Andzej,

The use case you are thinking is : at the parsing stage, filter out garbage
content and index only the rest.

I have a different use case, I want to keep everything as standard indexing
_AND_  also extract part for being indexed in a dedicated field (which will
be boosted at search time). In a document certain part have more importance
than others in my case.

So I would like either
1. to access html representation at indexing time... not possible or did not
find how
2. create a dual representation of the document, plain & standard, filtered
document

I think option 2. is much better because it better fits the model and allows
for a lot of different other use cases.

best regards,


2009/10/9 Andrzej Bialecki <a...@getopt.org>

> BELLINI ADAM wrote:
>
>> HI
>>
>> hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was
>> thinking to start to create an HTML tag filter class.
>> mabe i can create my own HTML parser ! as i do for parsing and indexing
>> DublinCore metadata...it sounds possible don't you think so ?
>>
>> i just hv to create also or to find a class which could filter an HTML
>> pages and delete certain tag from it
>>
>
> Guys, please take a look at how HtmlParseFilters are implemented - for
> example the creativecommons plugin. I believe that's exactly the
> functionality that you are looking for.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
-MilleBii-

Reply via email to