[jira] [Commented] (NUTCH-1785) Ability to index raw content

2015-07-30 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648294#comment-14648294
 ] 

Chris A. Mattmann commented on NUTCH-1785:
--

+1 to commit from me.

 Ability to index raw content
 

 Key: NUTCH-1785
 URL: https://issues.apache.org/jira/browse/NUTCH-1785
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.11

 Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, 
 NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch


 Some use-cases require Nutch to actually write the raw content a configured 
 indexing back-end. Since Content is never read, a plugin is out of the 
 question and therefore we need to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1785) Ability to index raw content

2015-07-30 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1785.
-
Resolution: Fixed

Committed revision 1693507

 Ability to index raw content
 

 Key: NUTCH-1785
 URL: https://issues.apache.org/jira/browse/NUTCH-1785
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.11

 Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, 
 NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch


 Some use-cases require Nutch to actually write the raw content a configured 
 indexing back-end. Since Content is never read, a plugin is out of the 
 question and therefore we need to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1785) Ability to index raw content

2015-07-30 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1785:

Attachment: NUTCH-1785-trunkv2.patch

This works perfectly for me locally. I would like to commit EoB today if no 
objections.

 Ability to index raw content
 

 Key: NUTCH-1785
 URL: https://issues.apache.org/jira/browse/NUTCH-1785
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.11

 Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, 
 NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch


 Some use-cases require Nutch to actually write the raw content a configured 
 indexing back-end. Since Content is never read, a plugin is out of the 
 question and therefore we need to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-1785) Ability to index raw content

2015-07-30 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648249#comment-14648249
 ] 

Lewis John McGibbney edited comment on NUTCH-1785 at 7/30/15 8:21 PM:
--

This works perfectly for me locally. I would like to commit EoB today if no 
objections. Excellent work [~markus17]



was (Author: lewismc):
This works perfectly for me locally. I would like to commit EoB today if no 
objections.

 Ability to index raw content
 

 Key: NUTCH-1785
 URL: https://issues.apache.org/jira/browse/NUTCH-1785
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.11

 Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, 
 NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch


 Some use-cases require Nutch to actually write the raw content a configured 
 indexing back-end. Since Content is never read, a plugin is out of the 
 question and therefore we need to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1785) Ability to index raw content

2015-07-30 Thread Thad Guidry (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648278#comment-14648278
 ] 

Thad Guidry commented on NUTCH-1785:


[~lewismc] No objections. It also worked perfectly for me as well.  Have been 
using it for a few months now pushing into ElasticSearch.

 Ability to index raw content
 

 Key: NUTCH-1785
 URL: https://issues.apache.org/jira/browse/NUTCH-1785
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.11

 Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, 
 NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch


 Some use-cases require Nutch to actually write the raw content a configured 
 indexing back-end. Since Content is never read, a plugin is out of the 
 question and therefore we need to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1785) Ability to index raw content

2015-07-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648380#comment-14648380
 ] 

Hudson commented on NUTCH-1785:
---

SUCCESS: Integrated in Nutch-trunk #3233 (See 
[https://builds.apache.org/job/Nutch-trunk/3233/])
NUTCH-1785 Ability to index raw content (lewismc: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1693507)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/schema-solr4.xml
* /nutch/trunk/conf/schema.xml
* /nutch/trunk/ivy/ivy.xml
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingJob.java


 Ability to index raw content
 

 Key: NUTCH-1785
 URL: https://issues.apache.org/jira/browse/NUTCH-1785
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.11

 Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, 
 NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch


 Some use-cases require Nutch to actually write the raw content a configured 
 indexing back-end. Since Content is never read, a plugin is out of the 
 question and therefore we need to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2071) A parser failure on a single document may fail crawling job

2015-07-30 Thread Arkadi Kosmynin (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arkadi Kosmynin updated NUTCH-2071:
---
Attachment: NUTCH-2071.diff

  A parser failure on a single document may fail crawling job
 

 Key: NUTCH-2071
 URL: https://issues.apache.org/jira/browse/NUTCH-2071
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Arkadi Kosmynin
 Attachments: NUTCH-2071.diff


 java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
 at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
 ...
 Caused by: java.lang.IncompatibleClassChangeError: class 
 org.apache.tika.parser.asm.XHTMLClassVisitor has interface 
 org.objectweb.asm.ClassVisitor as super class
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
 at 
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at 
 java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at 
 org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
 at 
 org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
 at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)
 Suggested fix in ParseUtil:
 Replace 
 if (maxParseTime!=-1)
parseResult = runParser(parsers[i], content);
 else 
parseResult = parsers[i].getParse(content);
 with
   try
   {
 if (maxParseTime!=-1)
parseResult = runParser(parsers[i], content);
 else 
parseResult = parsers[i].getParse(content);
   } catch( Throwable e )
   {
 LOG.warn( Parsing  + content.getUrl() +  with  + 
 parsers[i].getClass().getName() +  failed:  + e.getMessage() ) ;
 parseResult = null ;
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2071) A parser failure on a single document may fail crawling job

2015-07-30 Thread Arkadi Kosmynin (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arkadi Kosmynin updated NUTCH-2071:
---
 Flags: Patch
Patch Info: Patch Available

  A parser failure on a single document may fail crawling job
 

 Key: NUTCH-2071
 URL: https://issues.apache.org/jira/browse/NUTCH-2071
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Arkadi Kosmynin
 Attachments: NUTCH-2071.diff


 java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
 at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
 ...
 Caused by: java.lang.IncompatibleClassChangeError: class 
 org.apache.tika.parser.asm.XHTMLClassVisitor has interface 
 org.objectweb.asm.ClassVisitor as super class
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
 at 
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at 
 java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at 
 org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
 at 
 org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
 at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)
 Suggested fix in ParseUtil:
 Replace 
 if (maxParseTime!=-1)
parseResult = runParser(parsers[i], content);
 else 
parseResult = parsers[i].getParse(content);
 with
   try
   {
 if (maxParseTime!=-1)
parseResult = runParser(parsers[i], content);
 else 
parseResult = parsers[i].getParse(content);
   } catch( Throwable e )
   {
 LOG.warn( Parsing  + content.getUrl() +  with  + 
 parsers[i].getClass().getName() +  failed:  + e.getMessage() ) ;
 parseResult = null ;
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2071) A parser failure on a single document may fail crawling job

2015-07-30 Thread Arkadi Kosmynin (JIRA)
Arkadi Kosmynin created NUTCH-2071:
--

 Summary:  A parser failure on a single document may fail crawling 
job
 Key: NUTCH-2071
 URL: https://issues.apache.org/jira/browse/NUTCH-2071
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Arkadi Kosmynin


java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
...
Caused by: java.lang.IncompatibleClassChangeError: class 
org.apache.tika.parser.asm.XHTMLClassVisitor has interface 
org.objectweb.asm.ClassVisitor as super class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
at 
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)

Suggested fix in ParseUtil:

Replace 

if (maxParseTime!=-1)
   parseResult = runParser(parsers[i], content);
else 
   parseResult = parsers[i].getParse(content);

with

  try
  {
if (maxParseTime!=-1)
   parseResult = runParser(parsers[i], content);
else 
   parseResult = parsers[i].getParse(content);
  } catch( Throwable e )
  {
LOG.warn( Parsing  + content.getUrl() +  with  + 
parsers[i].getClass().getName() +  failed:  + e.getMessage() ) ;
parseResult = null ;
  }




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-07-30 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647467#comment-14647467
 ] 

Julien Nioche commented on NUTCH-2069:
--

Hi [~wastl-nagel] and [~markus17].  BTW did not mean to be short in my previous 
message but was typing from my phone ;-)
I know the difficulties of enforcing the code formatting systematically, but I 
thought I might as well fix it while I was working on that part of the code. 
Feel free to remove the bits from the patch that are about the formatting only.

bq. we could define this as two properties `db.ignore.external.links` + 
`db.ignore.external.links.mode`. The latter can be host or domain, similar 
to other properties (partition.url.mode, generator.count.mode, 
fetcher.queue.mode). That would be extensible and can make the code leaner.

yes that would be more elegant

on vacation for the next few weeks as of today, will update the code  based on 
your suggestion when I am back unless one of you beats me to it of course.

J.  



 Ignore external links based on domain
 -

 Key: NUTCH-2069
 URL: https://issues.apache.org/jira/browse/NUTCH-2069
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher, parser
Affects Versions: 1.10
Reporter: Julien Nioche
 Fix For: 1.11

 Attachments: NUTCH-2069.patch


 We currently have `db.ignore.external.links` which is a nice way of 
 restricting the crawl based on the hostname. This adds a new parameter 
 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1

2015-07-30 Thread Tanguy Moal (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanguy Moal updated NUTCH-2072:
---
Description: 
The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is not 
designed to have sizeLimit set to a negative value.

The fix can be simply to mimic what's done with gzip encoding : if 
{{getMaxContent()  0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} 
argument.

  was:
The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is 
designed to have sizeLimit set to a negative value.

The fix can be simply to mimic what's done with gzip encoding : if 
{{getMaxContent()  0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} 
argument.


 Deflate encoding support is broken when http.content.limit is set to -1
 ---

 Key: NUTCH-2072
 URL: https://issues.apache.org/jira/browse/NUTCH-2072
 Project: Nutch
  Issue Type: Bug
  Components: plugin, protocol
Reporter: Tanguy Moal
Priority: Minor

 The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is 
 not designed to have sizeLimit set to a negative value.
 The fix can be simply to mimic what's done with gzip encoding : if 
 {{getMaxContent()  0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} 
 argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1

2015-07-30 Thread Tanguy Moal (JIRA)
Tanguy Moal created NUTCH-2072:
--

 Summary: Deflate encoding support is broken when 
http.content.limit is set to -1
 Key: NUTCH-2072
 URL: https://issues.apache.org/jira/browse/NUTCH-2072
 Project: Nutch
  Issue Type: Bug
  Components: plugin, protocol
Reporter: Tanguy Moal
Priority: Minor


The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is 
designed to have sizeLimit set to a negative value.

The fix can be simply to mimic what's done with gzip encoding : if 
{{getMaxContent()  0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} 
argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1

2015-07-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647385#comment-14647385
 ] 

ASF GitHub Bot commented on NUTCH-2072:
---

GitHub user tuxnco opened a pull request:

https://github.com/apache/nutch/pull/48

Fix for NUTCH-2072

{{HttpBase}} : mimic the behaviour of {{processGzipEncoded}} in 
{{processDeflateEncoded}} regarding the handling of the {{http.content.limit}} 
especially when it's negative (unlimited).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cogniteev/nutch trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/48.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #48


commit e5a0a0943b91a64ee0cd71314546f0876df7789b
Author: Tanguy Moal tan...@cogniteev.com
Date:   2015-07-30T09:08:40Z

HttpBase: fix bug when http.content.limit is set to -1 and remote server 
uses deflate encoding




 Deflate encoding support is broken when http.content.limit is set to -1
 ---

 Key: NUTCH-2072
 URL: https://issues.apache.org/jira/browse/NUTCH-2072
 Project: Nutch
  Issue Type: Bug
  Components: plugin, protocol
Reporter: Tanguy Moal
Priority: Minor

 The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is 
 not designed to have sizeLimit set to a negative value.
 The fix can be simply to mimic what's done with gzip encoding : if 
 {{getMaxContent()  0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} 
 argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1

2015-07-30 Thread Tanguy Moal (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647388#comment-14647388
 ] 

Tanguy Moal commented on NUTCH-2072:


I provided a dumb fix there: https://github.com/apache/nutch/pull/48 .

I couldn't find any test regarding handling of HTTP compression and 
{{http.content.limit}} parameter, and setting those seems tedious. Feel free to 
guide me if we want to make that part more robust.

 Deflate encoding support is broken when http.content.limit is set to -1
 ---

 Key: NUTCH-2072
 URL: https://issues.apache.org/jira/browse/NUTCH-2072
 Project: Nutch
  Issue Type: Bug
  Components: plugin, protocol
Reporter: Tanguy Moal
Priority: Minor

 The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is 
 not designed to have sizeLimit set to a negative value.
 The fix can be simply to mimic what's done with gzip encoding : if 
 {{getMaxContent()  0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} 
 argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Fix for NUTCH-2072

2015-07-30 Thread tuxnco
GitHub user tuxnco opened a pull request:

https://github.com/apache/nutch/pull/48

Fix for NUTCH-2072

{{HttpBase}} : mimic the behaviour of {{processGzipEncoded}} in 
{{processDeflateEncoded}} regarding the handling of the {{http.content.limit}} 
especially when it's negative (unlimited).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cogniteev/nutch trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/48.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #48


commit e5a0a0943b91a64ee0cd71314546f0876df7789b
Author: Tanguy Moal tan...@cogniteev.com
Date:   2015-07-30T09:08:40Z

HttpBase: fix bug when http.content.limit is set to -1 and remote server 
uses deflate encoding




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---