Re: indexing just certain content

Andrzej Bialecki Sat, 10 Oct 2009 07:04:48 -0700

MilleBii wrote:

Andzej,


The use case you are thinking is : at the parsing stage, filter out garbage
content and index only the rest.

I have a different use case, I want to keep everything as standard indexing
_AND_  also extract part for being indexed in a dedicated field (which will
be boosted at search time). In a document certain part have more importance
than others in my case.

So I would like either
1. to access html representation at indexing time... not possible or did not
find how
2. create a dual representation of the document, plain & standard, filtered
document

I think option 2. is much better because it better fits the model and allows
for a lot of different other use cases.

Actually, creativecommons provides hints how to do this .. but to bemore explicit:

* in your HtmlParseFilter you need to extract from DOM tree the partsthat you want, and put them inside ParseData.metadata. This way you willpreserve both the original text, and your special parts that you extracted.

* in your IndexingFilter you will retrieve the parts fromParseData.metadata and add them as additional index fields (don't forgetto specify indexing backend options).

* in your QueryFilter plugin.xml you declare that QueryParser shouldpass your special fields without treating them as terms, and in theimplementation you create a BooleanClause to be added to the translatedquery.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: indexing just certain content

Reply via email to