RE: Two Nutch parallel crawl with two conf folder.

2010-03-09 Thread Pravin Karne
would use two nutch directory and of course two different crawl directory because with hadoop they will end-up on the same hdfs: (assuming you run in distribued or pseudo) 2010/3/9, Pravin Karne : > > Can we share Hadoop cluster between two nutch instance. > So there will be two nutch ins

RE: Two Nutch parallel crawl with two conf folder.

2010-03-08 Thread Pravin Karne
-user@lucene.apache.org Subject: Re: Two Nutch parallel crawl with two conf folder. How parallel is parallel in your case ? Don't forget Hadoop in distributed mode will serialize your jobs anyhow. For the rest why don't you create two Nutch directories and run things totally independently 2010/3/

RE: Two Nutch parallel crawl with two conf folder.

2010-03-08 Thread Pravin Karne
Hi guys any pointer on following. Your help will highly appreciated . Thanks -Pravin -Original Message- From: Pravin Karne Sent: Friday, March 05, 2010 12:57 PM To: nutch-user@lucene.apache.org Subject: Two Nutch parallel crawl with two conf folder. Hi, I want to do two Nutch parallel

Two Nutch parallel crawl with two conf folder.

2010-03-04 Thread Pravin Karne
Hi, I want to do two Nutch parallel crawl with two conf folder. I am using crawl command to do this. I have two separate conf folders, all files from conf are same except crawl-urlfilter.txt . In this file we have different filters(domain filters). e.g . 1 st conf have - +.^http

How to add sitemp attribute to crawldb while fetching

2010-02-18 Thread Pravin Karne
Hi, Sitemap.xml contains URLinfo for "updatefrequency" and "lastmodify" . So , while fetching the URLs, can we update crawldatum with above values. So long run crawl will have upadated information every time. No need to re-crawl for updated links By default this value is the 30 days(my underst

Cookies isue in nutch...

2010-02-15 Thread Pravin Karne
Hi , I am trying to cookie based authentication for nutch fetching. I want to fetch one page who required login credentials. I have valid cookies for this credentials. If I used this cookie in my stand alone application I am getting authenticated response(or required web page). But when I am add

Nutch is not crawling all outlinks

2009-09-22 Thread Pravin Karne
Hi Nutch is not crawling all outlinks even with following property db.max.outlinks.per.page -1 The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlink

Why Nutch is not crawling all links from web page

2009-09-22 Thread Pravin Karne
Hi, I am using nutch to crawl particular site. But I found that Nutch is not crawling all links from every pages. Is there any tuning parameter for nutch to crawl all links? Thanks in advance -Pravin DISCLAIMER == This e-mail may contain privileged and confidential information which i

RE: A few questions about crawl-urlfilter.txt

2009-07-16 Thread Pravin Karne
Hi I have same problem. I have following regex-urlfilter.txt # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(xd|AXD|bmp|BMP|class|CLASS|css|CSS|csv|CSV|dmg|DMG|doc|DOC|eps|EPS|exe|EXE|gif|GIF|gz|GZ|ico|ICO|ics|ICS|jpeg|JPEG|jpg|JPG|j

what is Non DFS Used in cluster summary? how to delete Non DFS Used data

2009-07-06 Thread Pravin Karne
Hi, I am using Nutch 1.0 with 10 node cluster. I have crawled 1000 sites with 10 depth. I got following cluster summary 5538 files and directories, 4556 blocks = 10094 total. Heap Size is 50 MB / 888.94 MB (5%) Configured Capacity : 140 TB DFS Used :

what is Non DFS Used in cluster summary ?how to delete it?

2009-07-06 Thread Pravin Karne
Hi, I am using Nutch 1.0 with 10 node cluster. I have crawled 1000 sites with 10 depth. I got following cluster summary 5538 files and directories, 4556 blocks = 10094 total. Heap Size is 50 MB / 888.94 MB (5%) Configured Capacity : 140 TB DFS Used :