Hi all,
I write my own parse plugin which implements HtmlParseFilter to parse the html content, and in the 'filter' method I add some additional parsed data into content metadata, all these work correctly when I run 'nutch/crawl'. But when I want to re-parse the content, I delete the crawl_parse, parse_data, parse_text folders, and then run 'nutch parse' to re-parse the content, the problem occurred, I found duplicated data in the content metadata, but the segment and degist in content metadata don't have any duplicated data, just those data I added in my custom parse plugin all have duplicated data. Do I write the parse plugin in a wrong way or I don't re-parse the content correctly. Can someone help me? Thanks a lot.