[EMAIL PROTECTED] wrote:
Hello,
I'm trying to determine if I could use Nutch for a project and having some
conceptual difficulties.
It appears that Nutch indexes by page. Each page/url is a Lucene document, with
fields for content, title, url, boost, etc... but I want to have a set of pages
represented by a single document. Is that possible?
For example, suppose I have a merchant who sells collectibles, and I have some
information on that merchant, such as name, general description, store
location, hours of operation, contact information, etc... and I also have a URL
to his site where more information is available. I want to be able to search
for merchants based on keywords found in the name, general description, or any
of the pages crawled by the url they specified. Is this possible?
It seems like to do this I'd have to add fields for name, general description,
etc... to each of the pages (docs) crawled, which seems like an lot of
redundancy. Is there a better way to do this?
Thanks,
-jim
Add an unique identifier to the document and use a separate external
database.