Re: index web

2009-03-19 Thread yanky young
Hi: i guess the urls you mentioned are all directed to the same jsp or servlet, apparently they all begin with http://app02.laopdr.gov.la/ePortal/news/detail.actionhttp://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome. the difference is the request_locale

Updatedb job failed with OutOfMemoryError

2009-03-19 Thread Edwin Chu
Hi, I am using the trunk version of Nutch in a cluster of 5 EC2 nodes to crawl the Internet. Each nodes has 7GB of memory and I have configured mapred.child.java.opts to -Xmx3000m in hadoop-site.xml. When I tried to update the crawldb of about 20M of urls with a crawl segment with 5M of fetched

Re: MergeSegments Error.

2009-03-19 Thread vishal vachhani
define follwoing prop in $nutch_home/conf/hodoop-site.xml in order to change the tmp folder path. property namehadoop.tmp.dir/name valueany-path/hadoop-${user.name}/value descriptionHadoop temp directory/description /property 2009/3/19 Armando Gonçalves mandinho...@gmail.com When I try to

Nutch doesn't find all urls.. Any suggestion?

2009-03-19 Thread MyD
Hi @ all, I'd like to run an intranet crawl with my own plugin on the domain www.wikicfp.com. (http://www.wikicfp.com/cfp/call?conference=artificial%20intelligenceskip=1) The problem is that nutch doesn't find the important urls, so nutch can't crawl further...

Crawling a ccTLD

2009-03-19 Thread Mauro Vignati
Hi, I'm testing Nutch and until now everything works fine (ok, some hours spent in reading, testing, testing and testing but it's normal. I have a noob question: I have to crawl websites only within a ccTLD. In the crawl-urlfilter.txt should I wright so: # accept hosts in MY.DOMAIN.NAME

Re: MergeSegments Error.

2009-03-19 Thread Armando Gonçalves
I didn't get ... why this should solve the problem ? the current configuration is : property namehadoop.tmp.dir/name value/tmp/hadoop-${user.name}/value descriptionA base for other temporary directories./description /property On Thu, Mar 19, 2009 at 7:35 AM, vishal vachhani

Re: Nutch doesn't find all urls.. Any suggestion?

2009-03-19 Thread alxsss
comment this line in -[...@=] in crawl-urlfilter.txt Alex. -Original Message- From: MyD myd.ro...@googlemail.com To: nutch-user@lucene.apache.org Sent: Thu, 19 Mar 2009 6:14 am Subject: Re: Nutch doesn't find all urls.. Any suggestion? I may have to say that in the

Re: Updatedb job failed with OutOfMemoryError

2009-03-19 Thread Julien Nioche
Hi Edwin, I had a similar issue which I solved by capping the number of incoming links to be taken into account for scoring a document. Another option is to use the patch I submitted (NUTCH-702) on JIRA and does the lazy instanciation of metadata; that should save a lot of RAM (and CPU). HTH

Re: Updatedb job failed with OutOfMemoryError

2009-03-19 Thread Edwin Chu
Thanks Julien I looked into the nutch-default.xml and I can't find a directive that can control the number of incoming links to be taken into account for scoring a document. I can find db.max.inlinks, but it look like controling the invertlinks process only. Could you tell me how to do it?

Re: index web

2009-03-19 Thread 陈琛
thanks.. the url is http://www.laopdr.gov.la/... depth 15 topN1200 ... seems must put

Re: index web

2009-03-19 Thread yanky young
that must work, but it seems weird. u know, from the seed url you given, nutch will crawl from the seed url and the whole crawled pages is actually a tree. the root node is the seed url. if u can not reach those two urls from the seed url by yourself, nutch can not too. yanky 2009/3/20 陈琛