Content-Length limit, URL filter and few minor issues
-----------------------------------------------------

                 Key: NUTCH-950
                 URL: https://issues.apache.org/jira/browse/NUTCH-950
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 2.0
            Reporter: Alexis


1. crawl command (nutch1.patch)

The class was renamed to Crawler but the references to it were not updated.


2. URL filter (nutch2.patch)

This avoids a NPE on bogus urls which host do not have a suffix.


3. Content-Length limit (nutch3.patch)

This is related to NUTCH-899.
The patch avoids the entire flush operation on the Gora datastore to crash 
because the MySQL blob limit was exceeded by a few bytes. Both protocol-http 
and protocol-httpclient plugins were problematic.


4. Ivy configuration (nutch4.patch)
- Change xercesImpl and restlet versions. These 2 version changes are required. 
The first one currently makes a JUnit test crash, the second one is missing in 
default Maven repository.

- Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL connector. 
These jars are necesary to run Gora with HBase or MySQL datastores. (more a 
suggestion that a requirement here)

- Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to