would use two nutch directory and of
course two different crawl directory because with hadoop they will
end-up on the same hdfs: (assuming you run in distribued or pseudo)
2010/3/9, Pravin Karne :
>
> Can we share Hadoop cluster between two nutch instance.
> So there will be two nutch ins
-user@lucene.apache.org
Subject: Re: Two Nutch parallel crawl with two conf folder.
How parallel is parallel in your case ?
Don't forget Hadoop in distributed mode will serialize your jobs anyhow.
For the rest why don't you create two Nutch directories and run things
totally independently
2010/3/
Hi guys any pointer on following.
Your help will highly appreciated .
Thanks
-Pravin
-Original Message-
From: Pravin Karne
Sent: Friday, March 05, 2010 12:57 PM
To: nutch-user@lucene.apache.org
Subject: Two Nutch parallel crawl with two conf folder.
Hi,
I want to do two Nutch parallel
Hi,
I want to do two Nutch parallel crawl with two conf folder.
I am using crawl command to do this. I have two separate conf folders, all
files from conf are same except crawl-urlfilter.txt . In this file we have
different filters(domain filters).
e.g . 1 st conf have -
+.^http
Hi,
Sitemap.xml contains URLinfo for "updatefrequency" and "lastmodify" .
So , while fetching the URLs, can we update crawldatum with above values.
So long run crawl will have upadated information every time. No need to
re-crawl for updated links
By default this value is the 30 days(my underst
Hi ,
I am trying to cookie based authentication for nutch fetching.
I want to fetch one page who required login credentials.
I have valid cookies for this credentials. If I used this cookie in my stand
alone application I am getting authenticated response(or required web page).
But when I am add
Hi
Nutch is not crawling all outlinks even with following property
db.max.outlinks.per.page
-1
The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlink
Hi,
I am using nutch to crawl particular site. But I found that Nutch is not
crawling all links from every pages.
Is there any tuning parameter for nutch to crawl all links?
Thanks in advance
-Pravin
DISCLAIMER
==
This e-mail may contain privileged and confidential information which i
Hi I have same problem.
I have following regex-urlfilter.txt
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(xd|AXD|bmp|BMP|class|CLASS|css|CSS|csv|CSV|dmg|DMG|doc|DOC|eps|EPS|exe|EXE|gif|GIF|gz|GZ|ico|ICO|ics|ICS|jpeg|JPEG|jpg|JPG|j
Hi,
I am using Nutch 1.0 with 10 node cluster.
I have crawled 1000 sites with 10 depth.
I got following cluster summary
5538 files and directories, 4556 blocks = 10094 total. Heap Size is 50 MB /
888.94 MB (5%)
Configured Capacity : 140 TB
DFS Used :
Hi,
I am using Nutch 1.0 with 10 node cluster.
I have crawled 1000 sites with 10 depth.
I got following cluster summary
5538 files and directories, 4556 blocks = 10094 total. Heap Size is 50 MB /
888.94 MB (5%)
Configured Capacity : 140 TB
DFS Used :
11 matches
Mail list logo