Re: Nutch 1.5 Deploy Mode Doesn't Work like Nutch 1.4 Deploy Mode

2012-06-19 Thread Julien Nioche
Good catch. Could you please open an issue on JIRA? On 19 June 2012 00:09, sidbatra siddharthaba...@gmail.com wrote: This turns out to be a genuine bug with an easy fix. build.xml is configured to generate a job file titled apache-nutch-1.5.job but the deploy binary is still looking for

Re: Nutch 1.5 Deploy Mode Doesn't Work like Nutch 1.4 Deploy Mode

2012-06-19 Thread Julien Nioche
Alternatively modify the bin/nutch script to make it more robust *# NUTCH_JOB if [ -f ${NUTCH_HOME}/*nutch*.job ]; then local=false for f in $NUTCH_HOME/*nutch*.job; do NUTCH_JOB=$f; done fi* On 19 June 2012 00:09, sidbatra siddharthaba...@gmail.com wrote: This turns out to be a

Re: nutch-2.0 updatedb and parse commands

2012-06-19 Thread Lewis John Mcgibbney
Hi Alex, On Mon, Jun 18, 2012 at 8:11 PM, alx...@aim.com wrote: Hello, It seems to me that all options to updatedb command that nutch 1.4 has, have been removed in nutch-2.0. I would like to know if this was done purposefully or they will be added later? As you have noticed there are a

Re: nutch-2.0 updatedb and parse commands

2012-06-19 Thread alxsss
Hi Lewis, In 1.X version there are -noAdditions options to updatedb and -adddays option to generate commands. How something similar to them can be done in 2.X version? Here, http://wiki.apache.org/nutch/Nutch2Roadmap it is stated Modify code so that parser can generate multiple documents

Deleting file: urls from crawldb that give 404 status

2012-06-19 Thread webdev1977
I am having an issue with removing deleted file: urls on subsequent crawls. It stays with a status of db_unfetched and doesn't seem to want to use the 404 (db_gone) status. This means that I can't run solrclean to get rid of the old file: urls. I poked around in the protocol-file code and

Re: Unable to fetch contents from this particular URL

2012-06-19 Thread Sebastian Nagel
Hi Sandeep, It just fetches text Analytical Cytometry. It looks like the property http.content.limit is still on its default (64kB) which causes the document to be truncated right after Analytical Cytometry. Unfortunately, truncated content is not logged to make it easier to locate the reason,

Re: Unable to fetch contents from this particular URL

2012-06-19 Thread Sandeep C R
Hi Sebastian, You are right. After setting it to -1 it worked. I am able to get all the text. Thank you. It will be really helpful if you/others can guide me with relative url's and regular expression problem which I have mentioned in main post. Regards, Sandeep On Tue, Jun 19, 2012 at 4:28

Re: Unable to fetch contents from this particular URL

2012-06-19 Thread Sebastian Nagel
Hi Sandeep, However, there is just relative url like this /research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx You don't have to care about relative URLs. They are converted by Nutch to absolute URLs and URL filters operate exclusively on absolute URLs. all the pages which

Re: Unable to fetch contents from this particular URL

2012-06-19 Thread Sandeep C R
Hi Sebastian, Worked perfectly. Thank you again. Regards, Sandeep On Tue, Jun 19, 2012 at 5:20 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Sandeep, However, there is just relative url like this /research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx You

Nutch 1.5 - Error: Java heap space during MAP step of CrawlDb update

2012-06-19 Thread sidbatra
I'm using Nutch 1.5 to crawl 30 sites in deploy mode on Amazon Elastic Map Reduce with 30 m1.small machines with the following settings: Parameter Value HADOOP_JOBTRACKER_HEAPSIZE 512 HADOOP_NAMENODE_HEAPSIZE512 HADOOP_TASKTRACKER_HEAPSIZE 256 HADOOP_DATANODE_HEAPSIZE

Re:Nutch 1.5 - Error: Java heap space during MAP step of CrawlDb update

2012-06-19 Thread 谢彦博
can U tell me how to unregedit this mail? i got alot of mail like nutch.apache.org so boring . 在2012-06-20,sidbatra siddharthaba...@gmail.com 写道: -原始邮件- 发件人:sidbatra siddharthaba...@gmail.com 发送时间:2012年06月20日 星期三 收件人:user user@nutch.apache.org 主题:Nutch 1.5 - Error: Java heap space