MilleBii wrote:
Andzej,

The use case you are thinking is : at the parsing stage, filter out garbage
content and index only the rest.

I have a different use case, I want to keep everything as standard indexing
_AND_  also extract part for being indexed in a dedicated field (which will
be boosted at search time). In a document certain part have more importance
than others in my case.

So I would like either
1. to access html representation at indexing time... not possible or did not
find how
2. create a dual representation of the document, plain & standard, filtered
document

I think option 2. is much better because it better fits the model and allows
for a lot of different other use cases.

Actually, creativecommons provides hints how to do this .. but to be more explicit:

* in your HtmlParseFilter you need to extract from DOM tree the parts that you want, and put them inside ParseData.metadata. This way you will preserve both the original text, and your special parts that you extracted.

* in your IndexingFilter you will retrieve the parts from ParseData.metadata and add them as additional index fields (don't forget to specify indexing backend options).

* in your QueryFilter plugin.xml you declare that QueryParser should pass your special fields without treating them as terms, and in the implementation you create a BooleanClause to be added to the translated query.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to