Solr newbe

Arne Muller Thu, 26 Jul 2007 05:04:09 -0700

Hello,

I've just started with Lucene to index a file server and aiming to index
lotus notes and some tables from relational databases.


After some research, I came (so far) to the conclusion that I'm re-inventing
the wheel, and that it may be better to use solr or nutch as lucene
front-ends.

I was wondering if the following workflow is kind of state of the art, or
whether you've something to comment or add:

Since there web-crawling is not necessary (my documents are on a file
server), I was thinking of using solr and to implement a basic program that
traverses the file system as a scheduled task (or cron job). The program
will not "crawl", i.e. it will not follow any references within the
documents to continue its search elsewhere. It just processes each found
file, and if the file's modification date is newer than 24h it adds it to
solr.

Most of the files are doc, xls, ppt and pdf. My "traverser" will therefore
use apache POI and PDFBox to extract the contents from the documents,
creating an appropriate XML stream with the solr "add" tag and sending it to
solr.

I was wondering if there are already tools for this kind of task available
(this time I'd like to avoid reinventing the wheel ;-)?

  thanks a lot for your help,
  +kind regards,

 Arne

Solr newbe

Reply via email to