what i want is exactly explained in this second post : How to ignore search results that don't have related keywords in main body?
> From: mbel...@msn.com > To: nutch-user@lucene.apache.org > Subject: RE: indexing just certain content > Date: Sat, 10 Oct 2009 15:35:31 +0000 > > > yes > > > > > MilleBii > > i tald you before that i created a DublinCore metadata parser and > indexer...so i parsed my html and created fileds to get my DC metadata...my > missing piece is how to delete sections form an html page :( if i will find > this piece the rest will be like a peice of cake :) > > > > > > Date: Sat, 10 Oct 2009 16:41:44 +0200 > > Subject: Re: indexing just certain content > > From: mille...@gmail.com > > To: nutch-user@lucene.apache.org > > > > Andrzej, > > > > Great !!! > > I did not realize you could put your own content in ParseData.metadata and > > read it back in the IndexingFilter... this was my missing piece in the > > puzzle, for the rest I knew what to do. > > > > Thanks, > > > > > > > > 2009/10/10 Andrzej Bialecki <a...@getopt.org> > > > > > MilleBii wrote: > > > > > >> Andzej, > > >> > > >> The use case you are thinking is : at the parsing stage, filter out > > >> garbage > > >> content and index only the rest. > > >> > > >> I have a different use case, I want to keep everything as standard > > >> indexing > > >> _AND_ also extract part for being indexed in a dedicated field (which > > >> will > > >> be boosted at search time). In a document certain part have more > > >> importance > > >> than others in my case. > > >> > > >> So I would like either > > >> 1. to access html representation at indexing time... not possible or did > > >> not > > >> find how > > >> 2. create a dual representation of the document, plain & standard, > > >> filtered > > >> document > > >> > > >> I think option 2. is much better because it better fits the model and > > >> allows > > >> for a lot of different other use cases. > > >> > > > > > > Actually, creativecommons provides hints how to do this .. but to be more > > > explicit: > > > > > > * in your HtmlParseFilter you need to extract from DOM tree the parts that > > > you want, and put them inside ParseData.metadata. This way you will > > > preserve > > > both the original text, and your special parts that you extracted. > > > > > > * in your IndexingFilter you will retrieve the parts from > > > ParseData.metadata and add them as additional index fields (don't forget > > > to > > > specify indexing backend options). > > > > > > * in your QueryFilter plugin.xml you declare that QueryParser should pass > > > your special fields without treating them as terms, and in the > > > implementation you create a BooleanClause to be added to the translated > > > query. > > > > > > > > > > > > -- > > > Best regards, > > > Andrzej Bialecki <>< > > > ___. ___ ___ ___ _ _ __________________________________ > > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > > ___|||__|| \| || | Embedded Unix, System Integration > > > http://www.sigram.com Contact: info at sigram dot com > > > > > > > > > > > > -- > > -MilleBii- > > _________________________________________________________________ > New! Faster Messenger access on the new MSN homepage > http://go.microsoft.com/?linkid=9677406 _________________________________________________________________ New! Get to Messenger faster: Sign-in here now! http://go.microsoft.com/?linkid=9677407