HTML Support for jsoup-extractor in Nutch 2.x?

Michael Chen Wed, 02 Aug 2017 14:42:59 -0700

Hi,

I'm trying to use the new jsoup-extractor in Nutch 2.x but it gives "Themarkup in the document following the root element must be well-formed"error when I hand it HTML. I re-read the descriptions in NUTCH-2389 andit seems that it's designed to parse XML only.

I'm still quite new to Nutch so I wanted some opinions on this, should Itry to implement HTML DOM building for jsoup-extractor or is it too muchwork/not feasible in Nutch 2.x? Any suggestions would be greatlyappreciated!


Go Nutch!

Michael

HTML Support for jsoup-extractor in Nutch 2.x?

Reply via email to