Re: Using Nutch (w/custom plugin) to crawl vs. custom Lucene app

Otis Gospodnetic Sun, 02 Aug 2009 20:26:06 -0700

Hello,

Lucene sounds like the way to go here.  What's more, if you have a copy of 
Lucene in Action (1st edition), I wrote a small and simple framework for 
file-system indexing.  You could define your own parser for your own custom 
file format and the indexer will use it.  I think it's in Chapter 7.  Source 
code is freely downloadable from manning.com/hatcher2 .


Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: "[email protected]" <[email protected]>
> To: [email protected]
> Sent: Monday, July 27, 2009 3:35:36 PM
> Subject: Using Nutch (w/custom plugin) to crawl vs. custom Lucene app
> 
> Hi,
> 
> I've been familiarizing myself with Nutch, in preparation for putting 
> together a 
> proof-of-concept (POC) that we are wanting.  Basically, we have some files of 
> proprietary file type, and we want to be able to search on specific "fields" 
> within these files.  The files are physically stored on the local filesystem.
> 
> Thus far, I've gotten an initial Nutch instance working, and also a 2nd Nutch 
> instance, configured for crawling the local filesystem.  These test instances 
> just use the "out-of-box" Nutch and Nutch plugins, e.g., the PDF plugin, just 
> to 
> allow me to get familiar with Nutch software.
> 
> Having done that, my original idea was to write some Nutch plugins that could 
> be 
> used with a Nutch crawl.
> 
> However, we already have some previously-built apps that basically "crawl" 
> (e.g., they do a recursive directory search on the local filesystem) the 
> local 
> filesystem and finds all of these files.  These are Java apps that we 
> previously 
> built for various purposes.
> 
> So, I'm wondering if it might make more sense (and I think may be easier) to 
> take one of those existing apps, and, basically, just enhance them to build 
> Lucene indexes, which could then be used by the Nutch web app (as a web-based 
> search web app)?
> 
> As I said, I'm really new to Nutch, and also to Lucene, but from what I've 
> researched so far, it *looks like* it'd be fairly easy to extend some of 
> existing apps to generate Lucene indexes, and I have some questions:
> 
> - If my custom Java app can be extended to "just" build indexes using Lucene, 
> is 
> that all that it needs to do in order for these to work ok with Nutch web app?
> 
> - Am I underestimating the effort needed to build the Lucene indexes that the 
> Nutch web app could use?
> 
> I was wondering if anyone here, has had to go through a similar situation 
> (Nutch 
> plugin for custom file type vs. custom crawl app to build Lucene indexes that 
> the Nutch web app can use)?  
> 
> Any other thoughts on all of this would be greatly appreciated from the 
> Nutch/Lucene experts here!!
> 
> Thanks,
> Jim

Re: Using Nutch (w/custom plugin) to crawl vs. custom Lucene app

Reply via email to