Andzej, The use case you are thinking is : at the parsing stage, filter out garbage content and index only the rest.
I have a different use case, I want to keep everything as standard indexing _AND_ also extract part for being indexed in a dedicated field (which will be boosted at search time). In a document certain part have more importance than others in my case. So I would like either 1. to access html representation at indexing time... not possible or did not find how 2. create a dual representation of the document, plain & standard, filtered document I think option 2. is much better because it better fits the model and allows for a lot of different other use cases. best regards, 2009/10/9 Andrzej Bialecki <a...@getopt.org> > BELLINI ADAM wrote: > >> HI >> >> hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was >> thinking to start to create an HTML tag filter class. >> mabe i can create my own HTML parser ! as i do for parsing and indexing >> DublinCore metadata...it sounds possible don't you think so ? >> >> i just hv to create also or to find a class which could filter an HTML >> pages and delete certain tag from it >> > > Guys, please take a look at how HtmlParseFilters are implemented - for > example the creativecommons plugin. I believe that's exactly the > functionality that you are looking for. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- -MilleBii-