On Fri, 9 Oct 2009 18:00:41 +0200
MilleBii <mille...@gmail.com> wrote:

> Don't think it will work because at the indexing filter stage all
> the HTML tags are gone from the text.
> 
> I think you need to modify the HTML parser to filter out the tags
> you want to get rid of.
> 
> In some use case I have I would like to perform 'intelligent
> indexing', ie use the tag information to extract specific fields
> to be indexed along with the main text. A reverse case of yours.
> Todate I did not find a way to do it.
> So if you find a solution I'm with you.
[...]

This is something that we would also be interested in. Actually,
we even have a working solution to extract content from between
start/stop tags, written by our colleagues from a partner company.

There are a couple of things that we would like to fix with this
solution:
(a) It directly modifies HtmlParser.java, which is obviously
    unmaintainable.
(b) It is a solution for specific tags, rather than picking them
    up from configuration parameters.
(c) We have not yet traced the complete execution path for Nutch,
    i.e., when is the parser called, when are filters called, etc.
    Is there a document anywhere about this? We were thinking of a
    filter, but from what you say above, that is the wrong stage.
(d) Ideally, whatever solution we come up with would be contributed
    back to Nutch, which also helps us from a maintenance
    standpoint. Is there a defined process for getting external
    plugins accepted into Nutch?

We are willing to put in some time into this, starting the coming
week. Where can we start a brainstorming Wiki for this? Is the
Nutch Wiki the right place?

Regards,
Gora

Reply via email to