Hi:
i guess the urls you mentioned are all directed to the same jsp or servlet,
apparently they all begin with
http://app02.laopdr.gov.la/ePortal/news/detail.actionhttp://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome.
the difference is the request_locale
Hi,
I am using the trunk version of Nutch in a cluster of 5 EC2 nodes to crawl
the Internet. Each nodes has 7GB of memory and I have
configured mapred.child.java.opts to -Xmx3000m in hadoop-site.xml. When I
tried to update the crawldb of about 20M of urls with a crawl segment with
5M of fetched
define follwoing prop in $nutch_home/conf/hodoop-site.xml in order to change
the tmp folder path.
property
namehadoop.tmp.dir/name
valueany-path/hadoop-${user.name}/value
descriptionHadoop temp directory/description /property
2009/3/19 Armando Gonçalves mandinho...@gmail.com
When I try to
Hi @ all,
I'd like to run an intranet crawl with my own plugin on the domain
www.wikicfp.com.
(http://www.wikicfp.com/cfp/call?conference=artificial%20intelligenceskip=1)
The problem is that nutch doesn't find the important urls, so nutch can't
crawl further...
Hi,
I'm testing Nutch and until now everything works fine (ok, some hours spent
in reading, testing, testing and testing but it's normal.
I have a noob question: I have to crawl websites only within a ccTLD.
In the crawl-urlfilter.txt should I wright so:
# accept hosts in MY.DOMAIN.NAME
I didn't get ... why this should solve the problem ?
the current configuration is :
property
namehadoop.tmp.dir/name
value/tmp/hadoop-${user.name}/value
descriptionA base for other temporary directories./description
/property
On Thu, Mar 19, 2009 at 7:35 AM, vishal vachhani
comment this line in -[...@=] in crawl-urlfilter.txt
Alex.
-Original Message-
From: MyD myd.ro...@googlemail.com
To: nutch-user@lucene.apache.org
Sent: Thu, 19 Mar 2009 6:14 am
Subject: Re: Nutch doesn't find all urls.. Any suggestion?
I may have to say that in the
Hi Edwin,
I had a similar issue which I solved by capping the number of incoming links
to be taken into account for scoring a document. Another option is to use
the patch I submitted (NUTCH-702) on JIRA and does the lazy instanciation of
metadata; that should save a lot of RAM (and CPU).
HTH
Thanks Julien
I looked into the nutch-default.xml and I can't find a directive that can
control the number of incoming links to be taken into account for scoring a
document. I can find db.max.inlinks, but it look like controling the
invertlinks process only. Could you tell me how to do it?
thanks..
the url is http://www.laopdr.gov.la/...
depth 15 topN1200 ...
seems must put
that must work, but it seems weird. u know, from the seed url you given,
nutch will crawl from the seed url and the whole crawled pages is actually a
tree. the root node is the seed url. if u can not reach those two urls from
the seed url by yourself, nutch can not too.
yanky
2009/3/20 陈琛
11 matches
Mail list logo