MilleBii wrote:
Andzej,
The use case you are thinking is : at the parsing stage, filter out garbage
content and index only the rest.
I have a different use case, I want to keep everything as standard indexing
_AND_ also extract part for being indexed in a dedicated field (which will
be boosted at search time). In a document certain part have more importance
than others in my case.
So I would like either
1. to access html representation at indexing time... not possible or did not
find how
2. create a dual representation of the document, plain & standard, filtered
document
I think option 2. is much better because it better fits the model and allows
for a lot of different other use cases.
Actually, creativecommons provides hints how to do this .. but to be
more explicit:
* in your HtmlParseFilter you need to extract from DOM tree the parts
that you want, and put them inside ParseData.metadata. This way you will
preserve both the original text, and your special parts that you extracted.
* in your IndexingFilter you will retrieve the parts from
ParseData.metadata and add them as additional index fields (don't forget
to specify indexing backend options).
* in your QueryFilter plugin.xml you declare that QueryParser should
pass your special fields without treating them as terms, and in the
implementation you create a BooleanClause to be added to the translated
query.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com