Hello Roland, I think a nice option would be using UIMA [1] which supports a pipeline architecture to analyze unstructured information. With that you can use CollectionReaders to get documents from various sources, Annotators to eventually extract metadata from documents [2] and then a Solr CAS Consumer to write everything to Solr [3].
You could also exploit the UIMA integration already committed under a dedicated Solr contrib module [4][5] which uses a custom UpdateHandler. Hope this helps, Tommaso [1] : http://uima.apache.org [2] : http://uima.apache.org/d/uimaj-2.3.1/overview_and_setup.html#ugr.ovv.conceptual.graduating_to_collection_processing [3] : http://uima.apache.org/sandbox.html#solrcas.consumer [4] : http://wiki.apache.org/solr/SolrUIMA [5] : http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/uima/ 2011/4/18 Roland Villemoes <[email protected]> > Hi All, > > > > I know this question may have been asked before – but I really did not find > any usable answers browsing the archives. So I have to try the developer > list here. > > > > We at Alpha Solutions often need a Pipeline for handling crawling, > analyzing and routing before we hit the UpdateRequestHandler in Solr. I know > we could actually use the UpdateRequestHandler for this - but often we like > to perform all these tasks before hitting Solr. > > We have been using OpenPipeline which does offer a GUI also which makes it > rather nice to administer (if you tweak the GUI a bit!). I does seem though, > that OpenPipeline will not really get going. Nothing happens, and there is > not really any community around it – and it doesn’t seem that the guys > that’s behind this will ever move this further. > > > > So we are looking around towards other “pipeline” projects that can work > well with Solr. > > > > So – does any of you have any ideas on this? Any recommendations? Or any > plans of this for Solr? > > > > Thanks a lot > > *Med venlig hilsen / Best regards* > > *Roland Villemoes* > *Tel:* (+45) 22 69 59 62 > *E-mail:* [email protected] > > *Alpha Solutions A/S* > Borgergade 2, 3.sal, DK-1300 Copenhagen K > *Tel:* (+45) 70 20 65 38 > *Web:* www.alpha-solutions.dk > > > ** This message including any attachments may contain confidential and/or > privileged information > intended only for the person or entity to which it is addressed. If you are > not the intended recipient > you should delete this message. Any printing, copying, distribution or > other use of this message is strictly prohibited. > If you have received this message in error, please notify the sender > immediately by telephone > or e-mail and delete all copies of this message and any attachments from > your system. Thank you. > > >
