[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193511#comment-15193511
 ] 

Markus Jelsma commented on NUTCH-2191:
--------------------------------------

Karanjeet, thanks for your contribution :)

Some notes on the patch before it gets committed:

* obvious system.outs need to be removed
* we need to consider whether we actually need 
NicelyResynchronizingAjaxController
* time outs in HttpResponse need to be configurable at least
* CSS and javascript enabled should be configurable, css disabled by default, 
Javascript enabled by default
* plugin needs to be listed in default.properties and build.xml
* writing a screenshot directly to disk is odd when running on Hadoop/HDFS, it 
would be better to use Hadoop's IO so we can write it on disk or HDFS 
transparently
* finally, we still need to address the redirect problem i described in 
HttpResponse.

{quote}
+    // Do not follow redirects so we can check response from outlinks
+    // If we don't follow redirects, we get an exception 
FailingHttpStatusCodeException: 302 Moved Temporarily
+    client.getOptions().setRedirectEnabled(true);  // If we disable this, the 
referenced hyperlinks are not followed, causing trouble loading JS, assets and 
stuff, but this also allows the input URL to be redirected without Nutch 
knowing it
{quote}

This is a serious problem and must be addressed as it would completely mess up 
the crawldb. This should be doable by overriding WebClient.download() iirc.

And probably some other things i overlooked :)

> Add protocol-htmlunit
> ---------------------
>
>                 Key: NUTCH-2191
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2191
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: 1.11
>            Reporter: Markus Jelsma
>            Assignee: Chris A. Mattmann
>             Fix For: 1.12
>
>         Attachments: NUTCH-2191.patch, NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to