Re: Where happens the inject of Redirects and outlinks?
Hi, I found it in updatedb. Exactly here: svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdateReducer.java?view=markup#l207 What was happening is that I had db.update.additions.allowed=false and it filters out too the redirects :\ (error after upgrading to 2.3-SNAPSHOT from 2.0). In my thoughts, redirects should not be the same as outlinks... :\ Anyway, solved :) Thanks! Alfonso 2014-11-19 16:34 GMT+01:00 Alfonso Nishikawa alfonso.nishik...@gmail.com: Hi, Lewis, For Nutch 2.3-SNAPSHOT (in 2.x branch if I am not wrong). Many thanks! :) Alfonso Hi Alfonso, On Tue, Nov 18, 2014 at 9:27 AM, dev-digest-h...@nutch.apache.org wrote: I am getting mad searching in plugins and everywhere :( surely someone here can just point me in a second a Class or a folder (that would be enough). For which codebase? Thanks Lewis 2014-11-18 18:26 GMT+01:00 Alfonso Nishikawa alfonso.nishik...@gmail.com : Hi, After https://issues.apache.org/jira/browse/NUTCH-1448:Redirected urls should be handled more cleanly (more like an outlink url) the redirects are treated as outlinks. Where does that outlinks get injected again in the webpage( (and specifically the redirects, although there is not difference). I am getting mad searching in plugins and everywhere :( surely someone here can just point me in a second a Class or a folder (that would be enough). Thanks! Alfonso
Re: [nsf-polar-usc-students] ExceptionInInitializerError caused by NPE
Great...maybe this is a bug in the Tika codebase! On Thu, Nov 20, 2014 at 10:02 AM, MengYing Wang mengyingwa...@gmail.com wrote: Dear Lewis, Problem solved by replacing the rome-1.0.jar back to rome-0.9.jar in parse-tika. Same idea as the feed parser in https://issues.apache.org/jira/browse/NUTCH-1494. Thanks. Best, Mengying (Angela) Wang On Wed, Nov 19, 2014 at 9:08 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Try removing 0.9 from that directory (copy elsewhere) and attempt to re parse the directory. Thanks On Wed, Nov 19, 2014 at 8:36 PM, MengYing Wang mengyingwa...@gmail.com wrote: Dear Lewis, In feed, it is rome-0.9 ( http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/feed/ivy.xml). While, in parse-Tika, it is rome-1.0 ( http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/parse-tika/plugin.xml). I have enabled both feed and parse-tika in the nutch-site.xml. Thanks. Best, Mengying (Angela) Wang On Wed, Nov 19, 2014 at 8:42 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Which version of Rome feed parser is in your class path? It may be activated via the Nutch 'feed' plugin or may also be come via Nutch 'parse-Tika' plugin. Please determine which version(s) are in class path and which are being used. On Wednesday, November 19, 2014, MengYing Wang mengyingwa...@gmail.com wrote: Hi Everyone, In the Nutch parse step, I received the following error. Does Anyone know how to solve the problem? Appreciate for your help! $ /cygdrive/d/nutch_trunk/runtime/local/bin/nutch parse -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1 crawlId/segments/20141118235323 java.lang.ExceptionInInitializerError at com.sun.syndication.io.SyndFeedInput.build(SyndFeedInput.java:136) at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:70) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:103) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:101) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) Caused by: java.lang.NullPointerException at java.util.Properties$LineReader.readLine(Properties.java:434) at java.util.Properties.load0(Properties.java:353) at java.util.Properties.load(Properties.java:341) at com.sun.syndication.io.impl.PropertiesLoader.init(PropertiesLoader.java:74) at com.sun.syndication.io.impl.PropertiesLoader.getPropertiesLoader(PropertiesLoader.java:46) at com.sun.syndication.io.impl.PluginManager.init(PluginManager.java:54) at com.sun.syndication.io.impl.PluginManager.init(PluginManager.java:46) at com.sun.syndication.feed.synd.impl.Converters.init(Converters.java:40) at com.sun.syndication.feed.synd.SyndFeedImpl.clinit(SyndFeedImpl.java:59) ... 10 more -- Best, Mengying (Angela) Wang -- You received this message because you are subscribed to the Google Groups nsf-polar-usc-students group. To unsubscribe from this group and stop receiving emails from it, send an email to nsf-polar-usc-students+unsubscr...@googlegroups.com. To post to this group, send email to nsf-polar-usc-stude...@googlegroups.com. Visit this group at http://groups.google.com/group/nsf-polar-usc-students. To view this discussion on the web visit https://groups.google.com/d/msgid/nsf-polar-usc-students/CAJX%3DLAuzcTtYe61Avq1EthNRYN6M-%2BGk%2B7PntdOYvQ4ZkrEJKw%40mail.gmail.com https://groups.google.com/d/msgid/nsf-polar-usc-students/CAJX%3DLAuzcTtYe61Avq1EthNRYN6M-%2BGk%2B7PntdOYvQ4ZkrEJKw%40mail.gmail.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- *Lewis* -- Best, Mengying (Angela) Wang -- *Lewis* -- Best, Mengying (Angela) Wang -- *Lewis*
Nutch in Windows: Failed to set permissions of path
Hi everyone, If you run the Nutch on Windows using the Cygwin, it may fail due to a permission error. $./crawl urls crawlId http://localhost:8983/solr/collection1 2 2014-11-17 15:39:25,041 ERROR security.UserGroupInformation - PriviledgedActionException as:YangLu cause:java.io.IOException: Failed to set permissions of path: \tmp\hadoop-YangLu\mapred\staging\YangLu534937598\.staging to 0700 2014-11-17 15:39:25,046 ERROR crawl.Injector - Injector: java.io.IOException: Failed to set permissions of path: \tmp\hadoop-YangLu\mapred\staging\YangLu534937598\.staging to 0700 at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:691) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:664) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:193) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.nutch.crawl.Injector.inject(Injector.java:324) at org.apache.nutch.crawl.Injector.run(Injector.java:380) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Injector.main(Injector.java:370) To solve the problem, you should download Hadoop Core 0.20.2 http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core/0.20.2 from the MVN repository to your (nutch-home)/lib directory. For the details, please refer to http://stackoverflow.com/questions/15188050/nutch-in-windows-failed-to-set-permissions-of-path. Thanks. -- Best, Mengying (Angela) Wang
Re: [nsf-polar-usc-students] Nutch in Windows: Failed to set permissions of path
This is not a good workaround at all. There are many reasons why this is not a good idea. If I were you, I would seriously suggest you download and work with VirtualBox on a Linux image. It will make your life so much easier anf the barrier to entry is very low these days. Lewis On Thu, Nov 20, 2014 at 10:29 AM, MengYing Wang mengyingwa...@gmail.com wrote: Hi everyone, If you run the Nutch on Windows using the Cygwin, it may fail due to a permission error. $./crawl urls crawlId http://localhost:8983/solr/collection1 2 2014-11-17 15:39:25,041 ERROR security.UserGroupInformation - PriviledgedActionException as:YangLu cause:java.io.IOException: Failed to set permissions of path: \tmp\hadoop-YangLu\mapred\staging\YangLu534937598\.staging to 0700 2014-11-17 15:39:25,046 ERROR crawl.Injector - Injector: java.io.IOException: Failed to set permissions of path: \tmp\hadoop-YangLu\mapred\staging\YangLu534937598\.staging to 0700 at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:691) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:664) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:193) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.nutch.crawl.Injector.inject(Injector.java:324) at org.apache.nutch.crawl.Injector.run(Injector.java:380) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Injector.main(Injector.java:370) To solve the problem, you should download Hadoop Core 0.20.2 http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core/0.20.2 from the MVN repository to your (nutch-home)/lib directory. For the details, please refer to http://stackoverflow.com/questions/15188050/nutch-in-windows-failed-to-set-permissions-of-path. Thanks. -- Best, Mengying (Angela) Wang -- You received this message because you are subscribed to the Google Groups nsf-polar-usc-students group. To unsubscribe from this group and stop receiving emails from it, send an email to nsf-polar-usc-students+unsubscr...@googlegroups.com. To post to this group, send email to nsf-polar-usc-stude...@googlegroups.com. Visit this group at http://groups.google.com/group/nsf-polar-usc-students. To view this discussion on the web visit https://groups.google.com/d/msgid/nsf-polar-usc-students/CAJX%3DLAu0DWxq-DzA3jipKq81KfxvjDy1-kgSbOQKQhYARvscOg%40mail.gmail.com https://groups.google.com/d/msgid/nsf-polar-usc-students/CAJX%3DLAu0DWxq-DzA3jipKq81KfxvjDy1-kgSbOQKQhYARvscOg%40mail.gmail.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- *Lewis*
Re: [nsf-polar-usc-students] ExceptionInInitializerError caused by NPE
Great, can you attach a patch for this? Chris Mattmann chris.mattm...@gmail.com -Original Message- From: MengYing Wang mengyingwa...@gmail.com Date: Thursday, November 20, 2014 at 7:02 PM To: Lewis John Mcgibbney lewis.mcgibb...@gmail.com Cc: dev@nutch.apache.org dev@nutch.apache.org, NSF Polar CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com Subject: Re: [nsf-polar-usc-students] ExceptionInInitializerError caused by NPE Dear Lewis, Problem solved by replacing the rome-1.0.jar back to rome-0.9.jar in parse-tika. Same idea as the feed parser in https://issues.apache.org/jira/browse/NUTCH-1494. Thanks. Best, Mengying (Angela) Wang On Wed, Nov 19, 2014 at 9:08 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Try removing 0.9 from that directory (copy elsewhere) and attempt to re parse the directory. Thanks On Wed, Nov 19, 2014 at 8:36 PM, MengYing Wang mengyingwa...@gmail.com wrote: Dear Lewis, In feed, it is rome-0.9 (http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/feed/ivy.xml). While, in parse-Tika, it is rome-1.0 (http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/parse-tika/plugin. xml). I have enabled both feed and parse-tika in the nutch-site.xml. Thanks. Best, Mengying (Angela) Wang On Wed, Nov 19, 2014 at 8:42 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Which version of Rome feed parser is in your class path?It may be activated via the Nutch 'feed' plugin or may also be come via Nutch 'parse-Tika' plugin. Please determine which version(s) are in class path and which are being used. On Wednesday, November 19, 2014, MengYing Wang mengyingwa...@gmail.com wrote: Hi Everyone, In the Nutch parse step, I received the following error. Does Anyone know how to solve the problem? Appreciate for your help! $ /cygdrive/d/nutch_trunk/runtime/local/bin/nutch parse -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1 crawlId/segments/20141118235323 java.lang.ExceptionInInitializerError at com.sun.syndication.io.SyndFeedInput.build(SyndFeedInput.java:136) at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:70) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:103) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:101) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) Caused by: java.lang.NullPointerException at java.util.Properties$LineReader.readLine(Properties.java:434) at java.util.Properties.load0(Properties.java:353) at java.util.Properties.load(Properties.java:341) at com.sun.syndication.io.impl.PropertiesLoader.init(PropertiesLoader.java: 74) at com.sun.syndication.io.impl.PropertiesLoader.getPropertiesLoader(Propertie sLoader.java:46) at com.sun.syndication.io.impl.PluginManager.init(PluginManager.java:54) at com.sun.syndication.io.impl.PluginManager.init(PluginManager.java:46) at com.sun.syndication.feed.synd.impl.Converters.init(Converters.java:40) at com.sun.syndication.feed.synd.SyndFeedImpl.clinit(SyndFeedImpl.java:59) ... 10 more -- Best, Mengying (Angela) Wang -- You received this message because you are subscribed to the Google Groups nsf-polar-usc-students group. To unsubscribe from this group and stop receiving emails from it, send an email to nsf-polar-usc-students+unsubscr...@googlegroups.com. To post to this group, send email to nsf-polar-usc-stude...@googlegroups.com. Visit this group at http://groups.google.com/group/nsf-polar-usc-students. To view this discussion on the web visit https://groups.google.com/d/msgid/nsf-polar-usc-students/CAJX%3DLAuzcTtYe6 1Avq1EthNRYN6M-%2BGk%2B7PntdOYvQ4ZkrEJKw%40mail.gmail.com https://groups.google.com/d/msgid/nsf-polar-usc-students/CAJX%3DLAuzcTtYe 61Avq1EthNRYN6M-%2BGk%2B7PntdOYvQ4ZkrEJKw%40mail.gmail.com?utm_medium=emai lutm_source=footer. For more options, visit https://groups.google.com/d/optout. -- Lewis -- Best, Mengying (Angela) Wang -- Lewis -- Best, Mengying (Angela) Wang -- You received this message because you are subscribed to the Google Groups nsf-polar-usc-students group. To unsubscribe from this group and stop receiving emails from it, send an email to nsf-polar-usc-students+unsubscr...@googlegroups.com. To post to this group, send email to
Re: [nsf-polar-usc-students] ExceptionInInitializerError caused by NPE
Dear Prof Mattmann, Yes, I will create a jira and attach the patch. But one more thing, do you happen to know how to modify the parse-tika configuration files to automatically download the rome-0.9.jar instead of the rome-1.0.jar? Currently, if you run the ant -f ./build-ivy.xml command in the parse-tika folder, the rome-1.0.jar is downloaded. I have to manually download the rome-0.9.jar file into the src/plugin/parse-tika/lib directory, and then modify the src/plugin/parse-tika/plugin.xml file to use rome-0.9.jar instead of rome-1.0.jar, which is not so convenient. Thanks for your help! Best, Mengying (Angela) Wang On Thu, Nov 20, 2014 at 3:12 AM, Chris Mattmann chris.mattm...@gmail.com wrote: Great, can you attach a patch for this? Chris Mattmann chris.mattm...@gmail.com -Original Message- From: MengYing Wang mengyingwa...@gmail.com Date: Thursday, November 20, 2014 at 7:02 PM To: Lewis John Mcgibbney lewis.mcgibb...@gmail.com Cc: dev@nutch.apache.org dev@nutch.apache.org, NSF Polar CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com Subject: Re: [nsf-polar-usc-students] ExceptionInInitializerError caused by NPE Dear Lewis, Problem solved by replacing the rome-1.0.jar back to rome-0.9.jar in parse-tika. Same idea as the feed parser in https://issues.apache.org/jira/browse/NUTCH-1494. Thanks. Best, Mengying (Angela) Wang On Wed, Nov 19, 2014 at 9:08 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Try removing 0.9 from that directory (copy elsewhere) and attempt to re parse the directory. Thanks On Wed, Nov 19, 2014 at 8:36 PM, MengYing Wang mengyingwa...@gmail.com wrote: Dear Lewis, In feed, it is rome-0.9 (http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/feed/ivy.xml). While, in parse-Tika, it is rome-1.0 (http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/parse-tika/plugin . xml). I have enabled both feed and parse-tika in the nutch-site.xml. Thanks. Best, Mengying (Angela) Wang On Wed, Nov 19, 2014 at 8:42 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Which version of Rome feed parser is in your class path?It may be activated via the Nutch 'feed' plugin or may also be come via Nutch 'parse-Tika' plugin. Please determine which version(s) are in class path and which are being used. On Wednesday, November 19, 2014, MengYing Wang mengyingwa...@gmail.com wrote: Hi Everyone, In the Nutch parse step, I received the following error. Does Anyone know how to solve the problem? Appreciate for your help! $ /cygdrive/d/nutch_trunk/runtime/local/bin/nutch parse -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1 crawlId/segments/20141118235323 java.lang.ExceptionInInitializerError at com.sun.syndication.io.SyndFeedInput.build(SyndFeedInput.java:136) at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:70) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:103) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:101) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) Caused by: java.lang.NullPointerException at java.util.Properties$LineReader.readLine(Properties.java:434) at java.util.Properties.load0(Properties.java:353) at java.util.Properties.load(Properties.java:341) at com.sun.syndication.io.impl.PropertiesLoader.init(PropertiesLoader.java: 74) at com.sun.syndication.io.impl.PropertiesLoader.getPropertiesLoader(Propertie sLoader.java:46) at com.sun.syndication.io.impl.PluginManager.init(PluginManager.java:54) at com.sun.syndication.io.impl.PluginManager.init(PluginManager.java:46) at com.sun.syndication.feed.synd.impl.Converters.init(Converters.java:40) at com.sun.syndication.feed.synd.SyndFeedImpl.clinit(SyndFeedImpl.java:59) ... 10 more -- Best, Mengying (Angela) Wang -- You received this message because you are subscribed to the Google Groups nsf-polar-usc-students group. To unsubscribe from this group and stop receiving emails from it, send an email to nsf-polar-usc-students+unsubscr...@googlegroups.com. To post to this group, send email to nsf-polar-usc-stude...@googlegroups.com. Visit this group at