RE: indexing just certain content

BELLINI ADAM Sat, 10 Oct 2009 08:36:05 -0700


i tald you before that i created a DublinCore metadata parser and indexer...so 
i parsed my html and created fileds to get my DC metadata...my missing piece is 
how to delete sections form an html page :( if i will find this piece the rest 
will be like a peice of cake :)




> Date: Sat, 10 Oct 2009 16:41:44 +0200
> Subject: Re: indexing just certain content
> From: mille...@gmail.com
> To: nutch-user@lucene.apache.org
> 
> Andrzej,
> 
> Great !!!
> I did not realize you could put your own content in ParseData.metadata and
> read it back in the IndexingFilter... this was my missing piece in the
> puzzle, for the rest I knew what to do.
> 
> Thanks,
> 
> 
> 
> 2009/10/10 Andrzej Bialecki <a...@getopt.org>
> 
> > MilleBii wrote:
> >
> >> Andzej,
> >>
> >> The use case you are thinking is : at the parsing stage, filter out
> >> garbage
> >> content and index only the rest.
> >>
> >> I have a different use case, I want to keep everything as standard
> >> indexing
> >> _AND_  also extract part for being indexed in a dedicated field (which
> >> will
> >> be boosted at search time). In a document certain part have more
> >> importance
> >> than others in my case.
> >>
> >> So I would like either
> >> 1. to access html representation at indexing time... not possible or did
> >> not
> >> find how
> >> 2. create a dual representation of the document, plain & standard,
> >> filtered
> >> document
> >>
> >> I think option 2. is much better because it better fits the model and
> >> allows
> >> for a lot of different other use cases.
> >>
> >
> > Actually, creativecommons provides hints how to do this .. but to be more
> > explicit:
> >
> > * in your HtmlParseFilter you need to extract from DOM tree the parts that
> > you want, and put them inside ParseData.metadata. This way you will preserve
> > both the original text, and your special parts that you extracted.
> >
> > * in your IndexingFilter you will retrieve the parts from
> > ParseData.metadata and add them as additional index fields (don't forget to
> > specify indexing backend options).
> >
> > * in your QueryFilter plugin.xml you declare that QueryParser should pass
> > your special fields without treating them as terms, and in the
> > implementation you create a BooleanClause to be added to the translated
> > query.
> >
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> 
> 
> -- 
> -MilleBii-
                                          
_________________________________________________________________
New! Faster Messenger access on the new MSN homepage
http://go.microsoft.com/?linkid=9677406

RE: indexing just certain content

Reply via email to