Good catch. Could you please open an issue on JIRA?
On 19 June 2012 00:09, sidbatra siddharthaba...@gmail.com wrote:
This turns out to be a genuine bug with an easy fix.
build.xml is configured to generate a job file titled
apache-nutch-1.5.job
but the deploy binary is still looking for
Alternatively modify the bin/nutch script to make it more robust
*# NUTCH_JOB
if [ -f ${NUTCH_HOME}/*nutch*.job ]; then
local=false
for f in $NUTCH_HOME/*nutch*.job; do
NUTCH_JOB=$f;
done
fi*
On 19 June 2012 00:09, sidbatra siddharthaba...@gmail.com wrote:
This turns out to be a
Hi Alex,
On Mon, Jun 18, 2012 at 8:11 PM, alx...@aim.com wrote:
Hello,
It seems to me that all options to updatedb command that nutch 1.4 has, have
been removed in nutch-2.0. I would like to know if this was done purposefully
or they will be added later?
As you have noticed there are a
Hi Lewis,
In 1.X version there are -noAdditions options to updatedb and -adddays option
to generate commands. How something similar to them can be done in 2.X version?
Here, http://wiki.apache.org/nutch/Nutch2Roadmap it is stated
Modify code so that parser can generate multiple documents
I am having an issue with removing deleted file: urls on subsequent crawls.
It stays with a status of db_unfetched and doesn't seem to want to use the
404 (db_gone) status. This means that I can't run solrclean to get rid of
the old file: urls.
I poked around in the protocol-file code and
Hi Sandeep,
It just fetches text Analytical Cytometry.
It looks like the property http.content.limit
is still on its default (64kB) which causes the
document to be truncated right after Analytical
Cytometry.
Unfortunately, truncated content is not logged
to make it easier to locate the reason,
Hi Sebastian,
You are right. After setting it to -1 it worked. I am able to get all the
text. Thank you.
It will be really helpful if you/others can guide me with relative url's
and regular expression problem which I have mentioned in main post.
Regards,
Sandeep
On Tue, Jun 19, 2012 at 4:28
Hi Sandeep,
However, there is just relative url like this
/research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx
You don't have to care about relative URLs. They are converted by Nutch
to absolute URLs and URL filters operate exclusively on absolute URLs.
all the pages which
Hi Sebastian,
Worked perfectly. Thank you again.
Regards,
Sandeep
On Tue, Jun 19, 2012 at 5:20 PM, Sebastian Nagel wastl.na...@googlemail.com
wrote:
Hi Sandeep,
However, there is just relative url like this
/research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx
You
I'm using Nutch 1.5 to crawl 30 sites in deploy mode on Amazon Elastic Map
Reduce with 30 m1.small machines with the following settings:
Parameter Value
HADOOP_JOBTRACKER_HEAPSIZE 512
HADOOP_NAMENODE_HEAPSIZE512
HADOOP_TASKTRACKER_HEAPSIZE 256
HADOOP_DATANODE_HEAPSIZE
can U tell me how to unregedit this mail?
i got alot of mail like nutch.apache.org so boring .
在2012-06-20,sidbatra siddharthaba...@gmail.com 写道: -原始邮件-
发件人:sidbatra siddharthaba...@gmail.com
发送时间:2012年06月20日 星期三
收件人:user user@nutch.apache.org
主题:Nutch 1.5 - Error: Java heap space
11 matches
Mail list logo