Hi,

I've been familiarizing myself with Nutch, in preparation for putting together 
a proof-of-concept (POC) that we are wanting.  Basically, we have some files of 
proprietary file type, and we want to be able to search on specific "fields" 
within these files.  The files are physically stored on the local filesystem.

Thus far, I've gotten an initial Nutch instance working, and also a 2nd Nutch 
instance, configured for crawling the local filesystem.  These test instances 
just use the "out-of-box" Nutch and Nutch plugins, e.g., the PDF plugin, just 
to allow me to get familiar with Nutch software.

Having done that, my original idea was to write some Nutch plugins that could 
be used with a Nutch crawl.

However, we already have some previously-built apps that basically "crawl" 
(e.g., they do a recursive directory search on the local filesystem) the local 
filesystem and finds all of these files.  These are Java apps that we 
previously built for various purposes.

So, I'm wondering if it might make more sense (and I think may be easier) to 
take one of those existing apps, and, basically, just enhance them to build 
Lucene indexes, which could then be used by the Nutch web app (as a web-based 
search web app)?

As I said, I'm really new to Nutch, and also to Lucene, but from what I've 
researched so far, it *looks like* it'd be fairly easy to extend some of 
existing apps to generate Lucene indexes, and I have some questions:

- If my custom Java app can be extended to "just" build indexes using Lucene, 
is that all that it needs to do in order for these to work ok with Nutch web 
app?

- Am I underestimating the effort needed to build the Lucene indexes that the 
Nutch web app could use?

I was wondering if anyone here, has had to go through a similar situation 
(Nutch plugin for custom file type vs. custom crawl app to build Lucene indexes 
that the Nutch web app can use)?  

Any other thoughts on all of this would be greatly appreciated from the 
Nutch/Lucene experts here!!

Thanks,
Jim

Reply via email to