[
https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410322#comment-13410322
]
Markus Jelsma commented on NUTCH-1087:
--------------------------------------
Works nicely but it cannot be run from the runtime/local directory. The wiki
usually describes commands to be run from there.
{code}$ bin/crawl urls/ crawl/crawldb http://localhost:8983/solr 2
bin/crawl: line 89: ./nutch: No such file or directory{code}
All goes well until invertlinks:
{code}LinkDb: starting at 2012-07-10 15:09:12
LinkDb: linkdb: ../crawl/crawldb/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: 20120710150834
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/home/markus/trunk/runtime/local/bin/20120710150834/parse_data
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:180)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:295)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:260){code}
I also think 2GB heap space for childs is far too much for common installations.
> Deprecate crawl command and replace with example script
> -------------------------------------------------------
>
> Key: NUTCH-1087
> URL: https://issues.apache.org/jira/browse/NUTCH-1087
> Project: Nutch
> Issue Type: Task
> Affects Versions: 1.4
> Reporter: Markus Jelsma
> Assignee: Julien Nioche
> Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1087.patch
>
>
> * remove the crawl command
> * add basic crawl shell script
> See thread:
> http://www.mail-archive.com/[email protected]/msg03848.html
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira