[
https://issues.apache.org/jira/browse/NUTCH-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245259#comment-13245259
]
behnam nikbakht commented on NUTCH-1270:
for example, with the site:
http://www.noormags.com/view/fa/default
when fetch the first page, and dump from segment, see that there is a problem
with fetch,
when i replace
byte[] content = DeflateUtils.inflateBestEffort(compressed, getMaxContent());
with
byte[] content = DeflateUtils.inflateBestEffort(compressed, 999);
it's work
some of Deflate encoded pages not fetched
-
Key: NUTCH-1270
URL: https://issues.apache.org/jira/browse/NUTCH-1270
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.4
Environment: software
Reporter: behnam nikbakht
Labels: fetch, processDeflateEncoded
Attachments: NUTCH-1270.patch
it is a problem with some of web pages that fetched but their content can not
retrived
after this change, this error fixed
we change lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
public byte[] processDeflateEncoded(byte[] compressed, URL url) throws
IOException {
if (LOGGER.isTraceEnabled()) { LOGGER.trace(inflating); }
byte[] content = DeflateUtils.inflateBestEffort(compressed,
getMaxContent());
+if(content==null)
+ content = DeflateUtils.inflateBestEffort(compressed, 20);
if (content == null)
throw new IOException(inflateBestEffort returned null);
if (LOGGER.isTraceEnabled()) {
LOGGER.trace(fetched + compressed.length
+ bytes of compressed content (expanded to
+ content.length + bytes) from + url);
}
return content;
}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira