Hi all,

 

         I write my own parse plugin which implements HtmlParseFilter to
parse the html content, and in the 'filter' method I add some additional
parsed data into content metadata, all these work correctly when I run
'nutch/crawl'. But when I want to re-parse the content, I delete the
crawl_parse, parse_data, parse_text folders, and then run 'nutch parse' to
re-parse the content, the problem occurred, I found duplicated data in the
content metadata, but the segment and degist in content metadata don't have
any duplicated data, just those data I added in my custom parse plugin all
have duplicated data.

 

         Do I write the parse plugin in a wrong way or I don't re-parse the
content correctly.

         

         Can someone help me?

         Thanks a lot.

Reply via email to