Hello, Lucene sounds like the way to go here. What's more, if you have a copy of Lucene in Action (1st edition), I wrote a small and simple framework for file-system indexing. You could define your own parser for your own custom file format and the indexer will use it. I think it's in Chapter 7. Source code is freely downloadable from manning.com/hatcher2 .
Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR ----- Original Message ---- > From: "[email protected]" <[email protected]> > To: [email protected] > Sent: Monday, July 27, 2009 3:35:36 PM > Subject: Using Nutch (w/custom plugin) to crawl vs. custom Lucene app > > Hi, > > I've been familiarizing myself with Nutch, in preparation for putting > together a > proof-of-concept (POC) that we are wanting. Basically, we have some files of > proprietary file type, and we want to be able to search on specific "fields" > within these files. The files are physically stored on the local filesystem. > > Thus far, I've gotten an initial Nutch instance working, and also a 2nd Nutch > instance, configured for crawling the local filesystem. These test instances > just use the "out-of-box" Nutch and Nutch plugins, e.g., the PDF plugin, just > to > allow me to get familiar with Nutch software. > > Having done that, my original idea was to write some Nutch plugins that could > be > used with a Nutch crawl. > > However, we already have some previously-built apps that basically "crawl" > (e.g., they do a recursive directory search on the local filesystem) the > local > filesystem and finds all of these files. These are Java apps that we > previously > built for various purposes. > > So, I'm wondering if it might make more sense (and I think may be easier) to > take one of those existing apps, and, basically, just enhance them to build > Lucene indexes, which could then be used by the Nutch web app (as a web-based > search web app)? > > As I said, I'm really new to Nutch, and also to Lucene, but from what I've > researched so far, it *looks like* it'd be fairly easy to extend some of > existing apps to generate Lucene indexes, and I have some questions: > > - If my custom Java app can be extended to "just" build indexes using Lucene, > is > that all that it needs to do in order for these to work ok with Nutch web app? > > - Am I underestimating the effort needed to build the Lucene indexes that the > Nutch web app could use? > > I was wondering if anyone here, has had to go through a similar situation > (Nutch > plugin for custom file type vs. custom crawl app to build Lucene indexes that > the Nutch web app can use)? > > Any other thoughts on all of this would be greatly appreciated from the > Nutch/Lucene experts here!! > > Thanks, > Jim
