But if I do that, then the other fields wouldn't get indexed, would they? What if I wanted to search for the keywords "sapphire" (which might only appear in the general description for the merchant), and "beenie baby" (which might appear in the content on one of thier pages). Wouldn't both need to be indexed?
What you're suggesting would only allow me to have a search engine for my pages, correct? and I think what I'm asking is can I use Nutch as a search engine for my pages in addition to some additional meta data about the content -- and if so, how? Would I have to tag each of the pages with the meta data? because that seems like a lot of redundancy... i.e. if I index 50 pages for merchant www.xyzcollectibles.com they're all going to have the same name, general description, etc... in thier metadata. -jim -------------- Original message -------------- > [EMAIL PROTECTED] wrote: > > >Hello, > > > >I'm trying to determine if I could use Nutch for a project and having some > conceptual difficulties. > > > >It appears that Nutch indexes by page. Each page/url is a Lucene document, > >with > fields for content, title, url, boost, etc... but I want to have a set of > pages > represented by a single document. Is that possible? > > > >For example, suppose I have a merchant who sells collectibles, and I have > >some > information on that merchant, such as name, general description, store > location, > hours of operation, contact information, etc... and I also have a URL to his > site where more information is available. I want to be able to search for > merchants based on keywords found in the name, general description, or any of > the pages crawled by the url they specified. Is this possible? > > > >It seems like to do this I'd have to add fields for name, general > >description, > etc... to each of the pages (docs) crawled, which seems like an lot of > redundancy. Is there a better way to do this? > > > >Thanks, > > > >-jim > > > > > Add an unique identifier to the document and use a separate external > database.
