Re: Documents in Nutch

dprantzalos Sat, 24 Sep 2005 09:18:34 -0700

But if I do that, then the other fields wouldn't get indexed, would they? What 
if I wanted to search for the keywords "sapphire" (which might only appear in 
the general description for the merchant), and "beenie baby" (which might 
appear in the content on one of thier pages). Wouldn't both need to be indexed?


What you're suggesting would only allow me to have a search engine for my 
pages, correct? and I think what I'm asking is can I use Nutch as a search 
engine for my pages in addition to some additional meta data about the content 
-- and if so, how? Would I have to tag each of the pages with the meta data? 
because that seems like a lot of redundancy... i.e. if I index 50 pages for 
merchant www.xyzcollectibles.com they're all going to have the same name, 
general description, etc... in thier metadata.

-jim


-------------- Original message -------------- 

> [EMAIL PROTECTED] wrote: 
> 
> >Hello, 
> > 
> >I'm trying to determine if I could use Nutch for a project and having some 
> conceptual difficulties. 
> > 
> >It appears that Nutch indexes by page. Each page/url is a Lucene document, 
> >with 
> fields for content, title, url, boost, etc... but I want to have a set of 
> pages 
> represented by a single document. Is that possible? 
> > 
> >For example, suppose I have a merchant who sells collectibles, and I have 
> >some 
> information on that merchant, such as name, general description, store 
> location, 
> hours of operation, contact information, etc... and I also have a URL to his 
> site where more information is available. I want to be able to search for 
> merchants based on keywords found in the name, general description, or any of 
> the pages crawled by the url they specified. Is this possible? 
> > 
> >It seems like to do this I'd have to add fields for name, general 
> >description, 
> etc... to each of the pages (docs) crawled, which seems like an lot of 
> redundancy. Is there a better way to do this? 
> > 
> >Thanks, 
> > 
> >-jim 
> > 
> > 
> Add an unique identifier to the document and use a separate external 
> database.

Re: Documents in Nutch

Reply via email to