from:"\"Pravin Karne\""

RE: Two Nutch parallel crawl with two conf folder.

2010-03-09 Thread Pravin Karne

would use two nutch directory and of course two different crawl directory because with hadoop they will end-up on the same hdfs: (assuming you run in distribued or pseudo) 2010/3/9, Pravin Karne : > > Can we share Hadoop cluster between two nutch instance. > So there will be two nutch ins

RE: Two Nutch parallel crawl with two conf folder.

2010-03-08 Thread Pravin Karne

-user@lucene.apache.org Subject: Re: Two Nutch parallel crawl with two conf folder. How parallel is parallel in your case ? Don't forget Hadoop in distributed mode will serialize your jobs anyhow. For the rest why don't you create two Nutch directories and run things totally independently 2010/3/

RE: Two Nutch parallel crawl with two conf folder.

2010-03-08 Thread Pravin Karne

Hi guys any pointer on following. Your help will highly appreciated . Thanks -Pravin -Original Message- From: Pravin Karne Sent: Friday, March 05, 2010 12:57 PM To: nutch-user@lucene.apache.org Subject: Two Nutch parallel crawl with two conf folder. Hi, I want to do two Nutch parallel

Two Nutch parallel crawl with two conf folder.

2010-03-04 Thread Pravin Karne

Hi, I want to do two Nutch parallel crawl with two conf folder. I am using crawl command to do this. I have two separate conf folders, all files from conf are same except crawl-urlfilter.txt . In this file we have different filters(domain filters). e.g . 1 st conf have - +.^http

How to add sitemp attribute to crawldb while fetching

2010-02-18 Thread Pravin Karne

Hi, Sitemap.xml contains URLinfo for "updatefrequency" and "lastmodify" . So , while fetching the URLs, can we update crawldatum with above values. So long run crawl will have upadated information every time. No need to re-crawl for updated links By default this value is the 30 days(my underst

Cookies isue in nutch...

2010-02-15 Thread Pravin Karne

Hi , I am trying to cookie based authentication for nutch fetching. I want to fetch one page who required login credentials. I have valid cookies for this credentials. If I used this cookie in my stand alone application I am getting authenticated response(or required web page). But when I am add

Nutch is not crawling all outlinks

2009-09-22 Thread Pravin Karne

Hi Nutch is not crawling all outlinks even with following property db.max.outlinks.per.page -1 The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlink

Why Nutch is not crawling all links from web page

2009-09-22 Thread Pravin Karne

Hi, I am using nutch to crawl particular site. But I found that Nutch is not crawling all links from every pages. Is there any tuning parameter for nutch to crawl all links? Thanks in advance -Pravin DISCLAIMER == This e-mail may contain privileged and confidential information which i

RE: A few questions about crawl-urlfilter.txt

2009-07-16 Thread Pravin Karne

Hi I have same problem. I have following regex-urlfilter.txt # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(xd|AXD|bmp|BMP|class|CLASS|css|CSS|csv|CSV|dmg|DMG|doc|DOC|eps|EPS|exe|EXE|gif|GIF|gz|GZ|ico|ICO|ics|ICS|jpeg|JPEG|jpg|JPG|j

what is Non DFS Used in cluster summary? how to delete Non DFS Used data

2009-07-06 Thread Pravin Karne

Hi, I am using Nutch 1.0 with 10 node cluster. I have crawled 1000 sites with 10 depth. I got following cluster summary 5538 files and directories, 4556 blocks = 10094 total. Heap Size is 50 MB / 888.94 MB (5%) Configured Capacity : 140 TB DFS Used :

what is Non DFS Used in cluster summary ?how to delete it?

2009-07-06 Thread Pravin Karne

Hi, I am using Nutch 1.0 with 10 node cluster. I have crawled 1000 sites with 10 depth. I got following cluster summary 5538 files and directories, 4556 blocks = 10094 total. Heap Size is 50 MB / 888.94 MB (5%) Configured Capacity : 140 TB DFS Used :

RE: Two Nutch parallel crawl with two conf folder.

RE: Two Nutch parallel crawl with two conf folder.

RE: Two Nutch parallel crawl with two conf folder.

Two Nutch parallel crawl with two conf folder.

How to add sitemp attribute to crawldb while fetching

Cookies isue in nutch...

Nutch is not crawling all outlinks

Why Nutch is not crawling all links from web page

RE: A few questions about crawl-urlfilter.txt

what is Non DFS Used in cluster summary? how to delete Non DFS Used data

what is Non DFS Used in cluster summary ?how to delete it?

11 matches

Site Navigation

Mail list logo

Footer information