My initial thought was to use Nutch to crawl a site and aggregate all the content into a single string (or file), save it to my db, and later use lucene to index it as just another field. I think that would work, but I didn't know if there was a better way.
-jim -------------- Original message -------------- > I'm sorry I partially missed the point. I don't know how the internals > of the indexing work, maybe someone else here can give an explanation? > > [EMAIL PROTECTED] wrote: > > >But if I do that, then the other fields wouldn't get indexed, would they? > >What > if I wanted to search for the keywords "sapphire" (which might only appear in > the general description for the merchant), and "beenie baby" (which might > appear > in the content on one of thier pages). Wouldn't both need to be indexed? > > > >What you're suggesting would only allow me to have a search engine for my > pages, correct? and I think what I'm asking is can I use Nutch as a search > engine for my pages in addition to some additional meta data about the > content > -- and if so, how? Would I have to tag each of the pages with the meta data? > because that seems like a lot of redundancy... i.e. if I index 50 pages for > merchant www.xyzcollectibles.com they're all going to have the same name, > general description, etc... in thier metadata. > > > > > > > >>Add an unique identifier to the document and use a separate external > >>database. > >> > >> >
