[jira] [Commented] (NUTCH-1785) Ability to index raw content
[ https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648294#comment-14648294 ] Chris A. Mattmann commented on NUTCH-1785: -- +1 to commit from me. Ability to index raw content Key: NUTCH-1785 URL: https://issues.apache.org/jira/browse/NUTCH-1785 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.11 Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch Some use-cases require Nutch to actually write the raw content a configured indexing back-end. Since Content is never read, a plugin is out of the question and therefore we need to force IndexJob to process Content as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-1785) Ability to index raw content
[ https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1785. - Resolution: Fixed Committed revision 1693507 Ability to index raw content Key: NUTCH-1785 URL: https://issues.apache.org/jira/browse/NUTCH-1785 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.11 Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch Some use-cases require Nutch to actually write the raw content a configured indexing back-end. Since Content is never read, a plugin is out of the question and therefore we need to force IndexJob to process Content as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1785) Ability to index raw content
[ https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1785: Attachment: NUTCH-1785-trunkv2.patch This works perfectly for me locally. I would like to commit EoB today if no objections. Ability to index raw content Key: NUTCH-1785 URL: https://issues.apache.org/jira/browse/NUTCH-1785 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.11 Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch Some use-cases require Nutch to actually write the raw content a configured indexing back-end. Since Content is never read, a plugin is out of the question and therefore we need to force IndexJob to process Content as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (NUTCH-1785) Ability to index raw content
[ https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648249#comment-14648249 ] Lewis John McGibbney edited comment on NUTCH-1785 at 7/30/15 8:21 PM: -- This works perfectly for me locally. I would like to commit EoB today if no objections. Excellent work [~markus17] was (Author: lewismc): This works perfectly for me locally. I would like to commit EoB today if no objections. Ability to index raw content Key: NUTCH-1785 URL: https://issues.apache.org/jira/browse/NUTCH-1785 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.11 Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch Some use-cases require Nutch to actually write the raw content a configured indexing back-end. Since Content is never read, a plugin is out of the question and therefore we need to force IndexJob to process Content as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1785) Ability to index raw content
[ https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648278#comment-14648278 ] Thad Guidry commented on NUTCH-1785: [~lewismc] No objections. It also worked perfectly for me as well. Have been using it for a few months now pushing into ElasticSearch. Ability to index raw content Key: NUTCH-1785 URL: https://issues.apache.org/jira/browse/NUTCH-1785 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.11 Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch Some use-cases require Nutch to actually write the raw content a configured indexing back-end. Since Content is never read, a plugin is out of the question and therefore we need to force IndexJob to process Content as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1785) Ability to index raw content
[ https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648380#comment-14648380 ] Hudson commented on NUTCH-1785: --- SUCCESS: Integrated in Nutch-trunk #3233 (See [https://builds.apache.org/job/Nutch-trunk/3233/]) NUTCH-1785 Ability to index raw content (lewismc: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1693507) * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/schema-solr4.xml * /nutch/trunk/conf/schema.xml * /nutch/trunk/ivy/ivy.xml * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingJob.java Ability to index raw content Key: NUTCH-1785 URL: https://issues.apache.org/jira/browse/NUTCH-1785 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.11 Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch Some use-cases require Nutch to actually write the raw content a configured indexing back-end. Since Content is never read, a plugin is out of the question and therefore we need to force IndexJob to process Content as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2071) A parser failure on a single document may fail crawling job
[ https://issues.apache.org/jira/browse/NUTCH-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arkadi Kosmynin updated NUTCH-2071: --- Attachment: NUTCH-2071.diff A parser failure on a single document may fail crawling job Key: NUTCH-2071 URL: https://issues.apache.org/jira/browse/NUTCH-2071 Project: Nutch Issue Type: Bug Components: parser Reporter: Arkadi Kosmynin Attachments: NUTCH-2071.diff java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213) ... Caused by: java.lang.IncompatibleClassChangeError: class org.apache.tika.parser.asm.XHTMLClassVisitor has interface org.objectweb.asm.ClassVisitor as super class at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:760) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103) Suggested fix in ParseUtil: Replace if (maxParseTime!=-1) parseResult = runParser(parsers[i], content); else parseResult = parsers[i].getParse(content); with try { if (maxParseTime!=-1) parseResult = runParser(parsers[i], content); else parseResult = parsers[i].getParse(content); } catch( Throwable e ) { LOG.warn( Parsing + content.getUrl() + with + parsers[i].getClass().getName() + failed: + e.getMessage() ) ; parseResult = null ; } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2071) A parser failure on a single document may fail crawling job
[ https://issues.apache.org/jira/browse/NUTCH-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arkadi Kosmynin updated NUTCH-2071: --- Flags: Patch Patch Info: Patch Available A parser failure on a single document may fail crawling job Key: NUTCH-2071 URL: https://issues.apache.org/jira/browse/NUTCH-2071 Project: Nutch Issue Type: Bug Components: parser Reporter: Arkadi Kosmynin Attachments: NUTCH-2071.diff java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213) ... Caused by: java.lang.IncompatibleClassChangeError: class org.apache.tika.parser.asm.XHTMLClassVisitor has interface org.objectweb.asm.ClassVisitor as super class at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:760) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103) Suggested fix in ParseUtil: Replace if (maxParseTime!=-1) parseResult = runParser(parsers[i], content); else parseResult = parsers[i].getParse(content); with try { if (maxParseTime!=-1) parseResult = runParser(parsers[i], content); else parseResult = parsers[i].getParse(content); } catch( Throwable e ) { LOG.warn( Parsing + content.getUrl() + with + parsers[i].getClass().getName() + failed: + e.getMessage() ) ; parseResult = null ; } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2071) A parser failure on a single document may fail crawling job
Arkadi Kosmynin created NUTCH-2071: -- Summary: A parser failure on a single document may fail crawling job Key: NUTCH-2071 URL: https://issues.apache.org/jira/browse/NUTCH-2071 Project: Nutch Issue Type: Bug Components: parser Reporter: Arkadi Kosmynin java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213) ... Caused by: java.lang.IncompatibleClassChangeError: class org.apache.tika.parser.asm.XHTMLClassVisitor has interface org.objectweb.asm.ClassVisitor as super class at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:760) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103) Suggested fix in ParseUtil: Replace if (maxParseTime!=-1) parseResult = runParser(parsers[i], content); else parseResult = parsers[i].getParse(content); with try { if (maxParseTime!=-1) parseResult = runParser(parsers[i], content); else parseResult = parsers[i].getParse(content); } catch( Throwable e ) { LOG.warn( Parsing + content.getUrl() + with + parsers[i].getClass().getName() + failed: + e.getMessage() ) ; parseResult = null ; } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2069) Ignore external links based on domain
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647467#comment-14647467 ] Julien Nioche commented on NUTCH-2069: -- Hi [~wastl-nagel] and [~markus17]. BTW did not mean to be short in my previous message but was typing from my phone ;-) I know the difficulties of enforcing the code formatting systematically, but I thought I might as well fix it while I was working on that part of the code. Feel free to remove the bits from the patch that are about the formatting only. bq. we could define this as two properties `db.ignore.external.links` + `db.ignore.external.links.mode`. The latter can be host or domain, similar to other properties (partition.url.mode, generator.count.mode, fetcher.queue.mode). That would be extensible and can make the code leaner. yes that would be more elegant on vacation for the next few weeks as of today, will update the code based on your suggestion when I am back unless one of you beats me to it of course. J. Ignore external links based on domain - Key: NUTCH-2069 URL: https://issues.apache.org/jira/browse/NUTCH-2069 Project: Nutch Issue Type: Improvement Components: fetcher, parser Affects Versions: 1.10 Reporter: Julien Nioche Fix For: 1.11 Attachments: NUTCH-2069.patch We currently have `db.ignore.external.links` which is a nice way of restricting the crawl based on the hostname. This adds a new parameter 'db.ignore.external.links.domain' to do the same based on the domain. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1
[ https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanguy Moal updated NUTCH-2072: --- Description: The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is not designed to have sizeLimit set to a negative value. The fix can be simply to mimic what's done with gzip encoding : if {{getMaxContent() 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} argument. was: The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is designed to have sizeLimit set to a negative value. The fix can be simply to mimic what's done with gzip encoding : if {{getMaxContent() 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} argument. Deflate encoding support is broken when http.content.limit is set to -1 --- Key: NUTCH-2072 URL: https://issues.apache.org/jira/browse/NUTCH-2072 Project: Nutch Issue Type: Bug Components: plugin, protocol Reporter: Tanguy Moal Priority: Minor The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is not designed to have sizeLimit set to a negative value. The fix can be simply to mimic what's done with gzip encoding : if {{getMaxContent() 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1
Tanguy Moal created NUTCH-2072: -- Summary: Deflate encoding support is broken when http.content.limit is set to -1 Key: NUTCH-2072 URL: https://issues.apache.org/jira/browse/NUTCH-2072 Project: Nutch Issue Type: Bug Components: plugin, protocol Reporter: Tanguy Moal Priority: Minor The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is designed to have sizeLimit set to a negative value. The fix can be simply to mimic what's done with gzip encoding : if {{getMaxContent() 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1
[ https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647385#comment-14647385 ] ASF GitHub Bot commented on NUTCH-2072: --- GitHub user tuxnco opened a pull request: https://github.com/apache/nutch/pull/48 Fix for NUTCH-2072 {{HttpBase}} : mimic the behaviour of {{processGzipEncoded}} in {{processDeflateEncoded}} regarding the handling of the {{http.content.limit}} especially when it's negative (unlimited). You can merge this pull request into a Git repository by running: $ git pull https://github.com/cogniteev/nutch trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/48.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #48 commit e5a0a0943b91a64ee0cd71314546f0876df7789b Author: Tanguy Moal tan...@cogniteev.com Date: 2015-07-30T09:08:40Z HttpBase: fix bug when http.content.limit is set to -1 and remote server uses deflate encoding Deflate encoding support is broken when http.content.limit is set to -1 --- Key: NUTCH-2072 URL: https://issues.apache.org/jira/browse/NUTCH-2072 Project: Nutch Issue Type: Bug Components: plugin, protocol Reporter: Tanguy Moal Priority: Minor The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is not designed to have sizeLimit set to a negative value. The fix can be simply to mimic what's done with gzip encoding : if {{getMaxContent() 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1
[ https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647388#comment-14647388 ] Tanguy Moal commented on NUTCH-2072: I provided a dumb fix there: https://github.com/apache/nutch/pull/48 . I couldn't find any test regarding handling of HTTP compression and {{http.content.limit}} parameter, and setting those seems tedious. Feel free to guide me if we want to make that part more robust. Deflate encoding support is broken when http.content.limit is set to -1 --- Key: NUTCH-2072 URL: https://issues.apache.org/jira/browse/NUTCH-2072 Project: Nutch Issue Type: Bug Components: plugin, protocol Reporter: Tanguy Moal Priority: Minor The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is not designed to have sizeLimit set to a negative value. The fix can be simply to mimic what's done with gzip encoding : if {{getMaxContent() 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: Fix for NUTCH-2072
GitHub user tuxnco opened a pull request: https://github.com/apache/nutch/pull/48 Fix for NUTCH-2072 {{HttpBase}} : mimic the behaviour of {{processGzipEncoded}} in {{processDeflateEncoded}} regarding the handling of the {{http.content.limit}} especially when it's negative (unlimited). You can merge this pull request into a Git repository by running: $ git pull https://github.com/cogniteev/nutch trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/48.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #48 commit e5a0a0943b91a64ee0cd71314546f0876df7789b Author: Tanguy Moal tan...@cogniteev.com Date: 2015-07-30T09:08:40Z HttpBase: fix bug when http.content.limit is set to -1 and remote server uses deflate encoding --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---