[EMAIL PROTECTED] wrote:

Hello,

I'm trying to determine if I could use Nutch for a project and having some 
conceptual difficulties.

It appears that Nutch indexes by page. Each page/url is a Lucene document, with 
fields for content, title, url, boost, etc... but I want to have a set of pages 
represented by a single document. Is that possible?

For example, suppose I have a merchant who sells collectibles, and I have some 
information on that merchant, such as name, general description, store 
location, hours of operation, contact information, etc... and I also have a URL 
to his site where more information is available. I want to be able to search 
for merchants based on keywords found in the name, general description, or any 
of the pages crawled by the url they specified. Is this possible?

It seems like to do this I'd have to add fields for name, general description, 
etc... to each of the pages (docs) crawled, which seems like an lot of 
redundancy. Is there a better way to do this?

Thanks,

-jim
Add an unique identifier to the document and use a separate external database.

Reply via email to