[
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193511#comment-15193511
]
Markus Jelsma commented on NUTCH-2191:
--------------------------------------
Karanjeet, thanks for your contribution :)
Some notes on the patch before it gets committed:
* obvious system.outs need to be removed
* we need to consider whether we actually need
NicelyResynchronizingAjaxController
* time outs in HttpResponse need to be configurable at least
* CSS and javascript enabled should be configurable, css disabled by default,
Javascript enabled by default
* plugin needs to be listed in default.properties and build.xml
* writing a screenshot directly to disk is odd when running on Hadoop/HDFS, it
would be better to use Hadoop's IO so we can write it on disk or HDFS
transparently
* finally, we still need to address the redirect problem i described in
HttpResponse.
{quote}
+ // Do not follow redirects so we can check response from outlinks
+ // If we don't follow redirects, we get an exception
FailingHttpStatusCodeException: 302 Moved Temporarily
+ client.getOptions().setRedirectEnabled(true); // If we disable this, the
referenced hyperlinks are not followed, causing trouble loading JS, assets and
stuff, but this also allows the input URL to be redirected without Nutch
knowing it
{quote}
This is a serious problem and must be addressed as it would completely mess up
the crawldb. This should be doable by overriding WebClient.download() iirc.
And probably some other things i overlooked :)
> Add protocol-htmlunit
> ---------------------
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
> Issue Type: New Feature
> Components: protocol
> Affects Versions: 1.11
> Reporter: Markus Jelsma
> Assignee: Chris A. Mattmann
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a
> portable library and should therefore be better suited for very large scale
> crawls. This issue is an attempt to implement protocol-htmlunit.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)