yes MilleBii
i tald you before that i created a DublinCore metadata parser and indexer...so i parsed my html and created fileds to get my DC metadata...my missing piece is how to delete sections form an html page :( if i will find this piece the rest will be like a peice of cake :) > Date: Sat, 10 Oct 2009 16:41:44 +0200 > Subject: Re: indexing just certain content > From: mille...@gmail.com > To: nutch-user@lucene.apache.org > > Andrzej, > > Great !!! > I did not realize you could put your own content in ParseData.metadata and > read it back in the IndexingFilter... this was my missing piece in the > puzzle, for the rest I knew what to do. > > Thanks, > > > > 2009/10/10 Andrzej Bialecki <a...@getopt.org> > > > MilleBii wrote: > > > >> Andzej, > >> > >> The use case you are thinking is : at the parsing stage, filter out > >> garbage > >> content and index only the rest. > >> > >> I have a different use case, I want to keep everything as standard > >> indexing > >> _AND_ also extract part for being indexed in a dedicated field (which > >> will > >> be boosted at search time). In a document certain part have more > >> importance > >> than others in my case. > >> > >> So I would like either > >> 1. to access html representation at indexing time... not possible or did > >> not > >> find how > >> 2. create a dual representation of the document, plain & standard, > >> filtered > >> document > >> > >> I think option 2. is much better because it better fits the model and > >> allows > >> for a lot of different other use cases. > >> > > > > Actually, creativecommons provides hints how to do this .. but to be more > > explicit: > > > > * in your HtmlParseFilter you need to extract from DOM tree the parts that > > you want, and put them inside ParseData.metadata. This way you will preserve > > both the original text, and your special parts that you extracted. > > > > * in your IndexingFilter you will retrieve the parts from > > ParseData.metadata and add them as additional index fields (don't forget to > > specify indexing backend options). > > > > * in your QueryFilter plugin.xml you declare that QueryParser should pass > > your special fields without treating them as terms, and in the > > implementation you create a BooleanClause to be added to the translated > > query. > > > > > > > > -- > > Best regards, > > Andrzej Bialecki <>< > > ___. ___ ___ ___ _ _ __________________________________ > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > ___|||__|| \| || | Embedded Unix, System Integration > > http://www.sigram.com Contact: info at sigram dot com > > > > > > > -- > -MilleBii- _________________________________________________________________ New! Faster Messenger access on the new MSN homepage http://go.microsoft.com/?linkid=9677406