Re: Index update and Google Dance

2005-11-09 Thread Andrzej Bialecki
Jack Tang wrote: Hi Andrzej In document, Michael said: I'd strongly recommend using the system with a replication rate of 3 copies, 2 minimum. Desired replication can be set in nutch config file using ndfs.replication property, and MIN_REPLICATION constant is located in ndfs/FSNamesystem.java

Re: Index update and Google Dance

2005-11-09 Thread Stefan Groschupf
and three copies of chunks are distributed on the slaves. If slave 1 is 90% busy, and 2 is 80% busy, 3 is idle. How does NFS do in this case? Actually you have to do that manually, but there will be a automatically solution later. Or could you tell me where should I start learning? The

Re: Index update and Google Dance

2005-11-09 Thread Jack Tang
Thanks for your explaination, Andrzej. I am going to read some NFS source codes and ask smarter questions later. Thanks again. Regards /Jack On 11/9/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: Jack Tang wrote: Hi Andrzej In document, Michael said: I'd strongly recommend using the system

Re: [Nutch-dev] [jira] Resolved: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

2005-11-09 Thread Massimo Miccoli
Ther's a problem with that solution. The protocol-httpclient now , for some site, gerate a SEVERE Narrowly avoided an infinite loop in execute So the fetcher exit ands only some pages is fetched until the SEVERE message. I don't know a solution, for now I switch back to protocoll-http.

Re: Lucene or Nutch

2005-11-09 Thread Erik Hatcher
Yes, Lucene is the best fit for what you're after. Nutch is built on Lucene, and adds web crawling on top. You don't need a web crawler, so using Lucene directly is the best fit - of course you'll have to write code to integrate Lucene. Erik On 9 Nov 2005, at 08:48, Klaus wrote:

Re: Lucene or Nutch

2005-11-09 Thread Jérôme Charron
Yes, Lucene is the best fit for what you're after. Nutch is built on Lucene, and adds web crawling on top. You don't need a web crawler, so using Lucene directly is the best fit - of course you'll have to write code to integrate Lucene. Erik, I was thinking about it for a while, but don't

RE: Lucene or Nutch

2005-11-09 Thread Rajan, Renuka
Hello All My question is kind of related to the email below. I was exploring the option to full-text index a fairly large database that's 40G in size (data alone minus indices etc). This data resides in Oracle which has its own full text indexing engine. Does anyone have a recommendation

protocol-http versus protocol-httpclient

2005-11-09 Thread Doug Cutting
I was recently benchmarking fetching at a site with lots of bandwidth, and it seemed to me that protocol-http is capable of faster crawling than protocol-httpclient. So I don't think we should discard protocol-http just yet. But there's a lot of duplicate code between these, which is

Request for info regarding filesystem based index.

2005-11-09 Thread Mike Reynols
Here's the problem: I need to get the Nutch engine running on a collection of xml documents that I have (containing news stories). The files are named in the following manner: example.xml.52908 example.xml.52909 example.xml.52910 example.xml.52911 ... example.xml.53365 example.xml.53366

Re: mapred bug -- bad part calculation?

2005-11-09 Thread Paul Baclace
Rod Taylor wrote: The attached patches for Generator.java and Injector.java allow a specific temporary directory to be specified. This gives Nutch the full path to these temporary directories and seems to fix the No input directories issue when using a local filesystem with multiple task

Re: Lucene or Nutch

2005-11-09 Thread Doug Cutting
Jérôme Charron wrote: In fact, I think it could be a good idea to move the nutch language identifier core code to a standalone library or to lucene code. Does it make sense? What do you think about it? What is the best solution (standalone vs lucene)? One could put it in the lucene contrib

Re: Index update and Google Dance

2005-11-09 Thread Doug Cutting
Jack Tang wrote: Below is google architecture in my brain: DataNode A Master DataNode B GoogleCrawler DataNode C .. GoogleCrawler is kept running all the time. One day, it gets fethlist from DataNode A, crawls all pages and

Re: [Nutch-dev] [jira] Resolved: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

2005-11-09 Thread Doug Cutting
Massimo Miccoli wrote: Ther's a problem with that solution. The protocol-httpclient now , for some site, gerate a SEVERE Narrowly avoided an infinite loop in execute So the fetcher exit ands only some pages is fetched until the SEVERE message. I don't know a solution, for now I switch back

Re: Lucene or Nutch

2005-11-09 Thread Andrzej Bialecki
Doug Cutting wrote: Jérôme Charron wrote: In fact, I think it could be a good idea to move the nutch language identifier core code to a standalone library or to lucene code. Does it make sense? What do you think about it? What is the best solution (standalone vs lucene)? One could put it

Re: protocol-http versus protocol-httpclient

2005-11-09 Thread Matt Kangas
+1 I've been planning to switch my crawler over to use protocol- httpclient, but haven't got there yet. Interesting that there seems to be a performance impact with the new plugin. (In my crawl setup, I override the default HTTP plugin so I can modify HTML content before it is written to

[jira] Closed: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-11-09 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-109?page=all ] Andrzej Bialecki closed NUTCH-109: --- Resolution: Invalid Proposed improvement is not real, and comes from different config. settings. Proposed implementation uses a component with

Re: Distributed nutch

2005-11-09 Thread Paul Baclace
In addition to Stefan Groschupf's detailed references, here are some short, high-level answers to your questions: Rozina Sorathia wrote: 1. What is Distributed nutch Nutch is a distributed Lucene with large scale web crawling. 2. How nutch distributed works? Modeled after Google's

Re: Index update and Google Dance

2005-11-09 Thread Jack Tang
Hi Doug On 11/10/05, Doug Cutting [EMAIL PROTECTED] wrote: Jack Tang wrote: Below is google architecture in my brain: DataNode A Master DataNode B GoogleCrawler DataNode C .. GoogleCrawler is kept running all

RE: protocol-http versus protocol-httpclient

2005-11-09 Thread Fuad Efendi
Doug Cutting wrote: ... protocol-http is capable of faster crawling than protocol-httpclient. So I don't think we should discard protocol-http just yet. What do others think? I think: HttpClient-based [protocol-httpclient] uses own Threads. [protocol-http] does not create Threads. We should

Re: protocol-http versus protocol-httpclient

2005-11-09 Thread Ken Krugler
I was recently benchmarking fetching at a site with lots of bandwidth, and it seemed to me that protocol-http is capable of faster crawling than protocol-httpclient. So I don't think we should discard protocol-http just yet. But there's a lot of duplicate code between these, which is

RE: Distributed nutch

2005-11-09 Thread Rozina Sorathia
Thanx for the explanation :) -Original Message- From: Paul Baclace [mailto:[EMAIL PROTECTED] Sent: Thursday, November 10, 2005 5:18 AM To: nutch-dev@lucene.apache.org Subject: Re: Distributed nutch In addition to Stefan Groschupf's detailed references, here are some short, high-level

What is suitable environment?

2005-11-09 Thread KAAS INFOTECH
Hi All, I am new to nutch. I have downloaded latest nutch-0.7.1. I have Microsoft window install on my PC with Java home Set. I came to know that cgywin is require to run nutch? Why Cannot I run nutch from windows? If so Do I need to change to Linux(any flavor of unix) ? How can we create test