RE: indexing just certain content

BELLINI ADAM Sat, 10 Oct 2009 08:43:28 -0700

what i want is exactly explained in this second post : How to ignore search 
results that don't have related keywords in main body?





> From: mbel...@msn.com
> To: nutch-user@lucene.apache.org
> Subject: RE: indexing just certain content
> Date: Sat, 10 Oct 2009 15:35:31 +0000
> 
> 
> yes 
>             
>                 
>                     
>                     
>                     MilleBii 
> 
> i tald you before that i created a DublinCore metadata parser and 
> indexer...so i parsed my html and created fileds to get my DC metadata...my 
> missing piece is how to delete sections form an html page :( if i will find 
> this piece the rest will be like a peice of cake :)
> 
> 
> 
> 
> > Date: Sat, 10 Oct 2009 16:41:44 +0200
> > Subject: Re: indexing just certain content
> > From: mille...@gmail.com
> > To: nutch-user@lucene.apache.org
> > 
> > Andrzej,
> > 
> > Great !!!
> > I did not realize you could put your own content in ParseData.metadata and
> > read it back in the IndexingFilter... this was my missing piece in the
> > puzzle, for the rest I knew what to do.
> > 
> > Thanks,
> > 
> > 
> > 
> > 2009/10/10 Andrzej Bialecki <a...@getopt.org>
> > 
> > > MilleBii wrote:
> > >
> > >> Andzej,
> > >>
> > >> The use case you are thinking is : at the parsing stage, filter out
> > >> garbage
> > >> content and index only the rest.
> > >>
> > >> I have a different use case, I want to keep everything as standard
> > >> indexing
> > >> _AND_  also extract part for being indexed in a dedicated field (which
> > >> will
> > >> be boosted at search time). In a document certain part have more
> > >> importance
> > >> than others in my case.
> > >>
> > >> So I would like either
> > >> 1. to access html representation at indexing time... not possible or did
> > >> not
> > >> find how
> > >> 2. create a dual representation of the document, plain & standard,
> > >> filtered
> > >> document
> > >>
> > >> I think option 2. is much better because it better fits the model and
> > >> allows
> > >> for a lot of different other use cases.
> > >>
> > >
> > > Actually, creativecommons provides hints how to do this .. but to be more
> > > explicit:
> > >
> > > * in your HtmlParseFilter you need to extract from DOM tree the parts that
> > > you want, and put them inside ParseData.metadata. This way you will 
> > > preserve
> > > both the original text, and your special parts that you extracted.
> > >
> > > * in your IndexingFilter you will retrieve the parts from
> > > ParseData.metadata and add them as additional index fields (don't forget 
> > > to
> > > specify indexing backend options).
> > >
> > > * in your QueryFilter plugin.xml you declare that QueryParser should pass
> > > your special fields without treating them as terms, and in the
> > > implementation you create a BooleanClause to be added to the translated
> > > query.
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Andrzej Bialecki     <><
> > >  ___. ___ ___ ___ _ _   __________________________________
> > > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > > http://www.sigram.com  Contact: info at sigram dot com
> > >
> > >
> > 
> > 
> > -- 
> > -MilleBii-
>                                         
> _________________________________________________________________
> New! Faster Messenger access on the new MSN homepage
> http://go.microsoft.com/?linkid=9677406
                                          
_________________________________________________________________
New! Get to Messenger faster: Sign-in here now!
http://go.microsoft.com/?linkid=9677407

RE: indexing just certain content

Reply via email to