[
https://issues.apache.org/jira/browse/NUTCH-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243194#comment-14243194
]
Sebastian Nagel commented on NUTCH-1898:
----------------------------------------
The raw document could be also viewed by (similar for other protocol
implementations):
{noformat}
bin/nutch plugin protocol-http org.apache.nutch.protocol.http.Http <url>
{noformat}
But agreed, to find problems it's always useful to have a look at the raw HTML,
and it's easier to have few but powerful debugging tools.
Or does "raw" mean the serialized DOM which (1) is also available for binary,
non-HTML document formats, and (2) may look slightly different and (3) isn't
easily viewed by any other tool?
> Add -dumpRawHTML prameter to parsechecker tool
> ----------------------------------------------
>
> Key: NUTCH-1898
> URL: https://issues.apache.org/jira/browse/NUTCH-1898
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.9, 2.2.1
> Reporter: Lewis John McGibbney
> Priority: Minor
> Fix For: 2.4, 1.10
>
>
> The ability to obtain raw HTML alongside all of the other parse data we get
> within existing parsechecker would compliment the tool.
> This issue should merely append the raw HTML markup to the existing output.
> It should be an optional parameter, same as -dumpText
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)