Hello, I've just started with Lucene to index a file server and aiming to index lotus notes and some tables from relational databases.
After some research, I came (so far) to the conclusion that I'm re-inventing the wheel, and that it may be better to use solr or nutch as lucene front-ends. I was wondering if the following workflow is kind of state of the art, or whether you've something to comment or add: Since there web-crawling is not necessary (my documents are on a file server), I was thinking of using solr and to implement a basic program that traverses the file system as a scheduled task (or cron job). The program will not "crawl", i.e. it will not follow any references within the documents to continue its search elsewhere. It just processes each found file, and if the file's modification date is newer than 24h it adds it to solr. Most of the files are doc, xls, ppt and pdf. My "traverser" will therefore use apache POI and PDFBox to extract the contents from the documents, creating an appropriate XML stream with the solr "add" tag and sending it to solr. I was wondering if there are already tools for this kind of task available (this time I'd like to avoid reinventing the wheel ;-)? thanks a lot for your help, +kind regards, Arne