date:20051109

Re: Index update and Google Dance

2005-11-09 Thread Andrzej Bialecki

Jack Tang wrote: Hi Andrzej In document, Michael said: I'd strongly recommend using the system with a replication rate of 3 copies, 2 minimum. Desired replication can be set in nutch config file using ndfs.replication property, and MIN_REPLICATION constant is located in ndfs/FSNamesystem.java

Re: Index update and Google Dance

2005-11-09 Thread Stefan Groschupf

and three copies of chunks are distributed on the slaves. If slave 1 is 90% busy, and 2 is 80% busy, 3 is idle. How does NFS do in this case? Actually you have to do that manually, but there will be a automatically solution later. Or could you tell me where should I start learning? The

Re: Index update and Google Dance

2005-11-09 Thread Jack Tang

Thanks for your explaination, Andrzej. I am going to read some NFS source codes and ask smarter questions later. Thanks again. Regards /Jack On 11/9/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: Jack Tang wrote: Hi Andrzej In document, Michael said: I'd strongly recommend using the system

Re: [Nutch-dev] [jira] Resolved: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

2005-11-09 Thread Massimo Miccoli

Ther's a problem with that solution. The protocol-httpclient now , for some site, gerate a SEVERE Narrowly avoided an infinite loop in execute So the fetcher exit ands only some pages is fetched until the SEVERE message. I don't know a solution, for now I switch back to protocoll-http.

Re: Lucene or Nutch

2005-11-09 Thread Erik Hatcher

Yes, Lucene is the best fit for what you're after. Nutch is built on Lucene, and adds web crawling on top. You don't need a web crawler, so using Lucene directly is the best fit - of course you'll have to write code to integrate Lucene. Erik On 9 Nov 2005, at 08:48, Klaus wrote:

Re: Lucene or Nutch

2005-11-09 Thread Jérôme Charron

Yes, Lucene is the best fit for what you're after. Nutch is built on Lucene, and adds web crawling on top. You don't need a web crawler, so using Lucene directly is the best fit - of course you'll have to write code to integrate Lucene. Erik, I was thinking about it for a while, but don't

RE: Lucene or Nutch

2005-11-09 Thread Rajan, Renuka

Hello All My question is kind of related to the email below. I was exploring the option to full-text index a fairly large database that's 40G in size (data alone minus indices etc). This data resides in Oracle which has its own full text indexing engine. Does anyone have a recommendation

protocol-http versus protocol-httpclient

2005-11-09 Thread Doug Cutting

I was recently benchmarking fetching at a site with lots of bandwidth, and it seemed to me that protocol-http is capable of faster crawling than protocol-httpclient. So I don't think we should discard protocol-http just yet. But there's a lot of duplicate code between these, which is

Request for info regarding filesystem based index.

2005-11-09 Thread Mike Reynols

Here's the problem: I need to get the Nutch engine running on a collection of xml documents that I have (containing news stories). The files are named in the following manner: example.xml.52908 example.xml.52909 example.xml.52910 example.xml.52911 ... example.xml.53365 example.xml.53366

Re: mapred bug -- bad part calculation?

2005-11-09 Thread Paul Baclace

Rod Taylor wrote: The attached patches for Generator.java and Injector.java allow a specific temporary directory to be specified. This gives Nutch the full path to these temporary directories and seems to fix the No input directories issue when using a local filesystem with multiple task

Re: Lucene or Nutch

2005-11-09 Thread Doug Cutting

Jérôme Charron wrote: In fact, I think it could be a good idea to move the nutch language identifier core code to a standalone library or to lucene code. Does it make sense? What do you think about it? What is the best solution (standalone vs lucene)? One could put it in the lucene contrib

Re: Index update and Google Dance

2005-11-09 Thread Doug Cutting

Jack Tang wrote: Below is google architecture in my brain: DataNode A Master DataNode B GoogleCrawler DataNode C .. GoogleCrawler is kept running all the time. One day, it gets fethlist from DataNode A, crawls all pages and

Re: [Nutch-dev] [jira] Resolved: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

2005-11-09 Thread Doug Cutting

Massimo Miccoli wrote: Ther's a problem with that solution. The protocol-httpclient now , for some site, gerate a SEVERE Narrowly avoided an infinite loop in execute So the fetcher exit ands only some pages is fetched until the SEVERE message. I don't know a solution, for now I switch back

Re: Lucene or Nutch

2005-11-09 Thread Andrzej Bialecki

Doug Cutting wrote: Jérôme Charron wrote: In fact, I think it could be a good idea to move the nutch language identifier core code to a standalone library or to lucene code. Does it make sense? What do you think about it? What is the best solution (standalone vs lucene)? One could put it

Re: protocol-http versus protocol-httpclient

2005-11-09 Thread Matt Kangas

+1 I've been planning to switch my crawler over to use protocol- httpclient, but haven't got there yet. Interesting that there seems to be a performance impact with the new plugin. (In my crawl setup, I override the default HTTP plugin so I can modify HTML content before it is written to

[jira] Closed: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-11-09 Thread Andrzej Bialecki (JIRA)

[ http://issues.apache.org/jira/browse/NUTCH-109?page=all ] Andrzej Bialecki closed NUTCH-109: --- Resolution: Invalid Proposed improvement is not real, and comes from different config. settings. Proposed implementation uses a component with

Re: Distributed nutch

2005-11-09 Thread Paul Baclace

In addition to Stefan Groschupf's detailed references, here are some short, high-level answers to your questions: Rozina Sorathia wrote: 1. What is Distributed nutch Nutch is a distributed Lucene with large scale web crawling. 2. How nutch distributed works? Modeled after Google's

Re: Index update and Google Dance

2005-11-09 Thread Jack Tang

Hi Doug On 11/10/05, Doug Cutting [EMAIL PROTECTED] wrote: Jack Tang wrote: Below is google architecture in my brain: DataNode A Master DataNode B GoogleCrawler DataNode C .. GoogleCrawler is kept running all

RE: protocol-http versus protocol-httpclient

2005-11-09 Thread Fuad Efendi

Doug Cutting wrote: ... protocol-http is capable of faster crawling than protocol-httpclient. So I don't think we should discard protocol-http just yet. What do others think? I think: HttpClient-based [protocol-httpclient] uses own Threads. [protocol-http] does not create Threads. We should

Re: protocol-http versus protocol-httpclient

2005-11-09 Thread Ken Krugler

I was recently benchmarking fetching at a site with lots of bandwidth, and it seemed to me that protocol-http is capable of faster crawling than protocol-httpclient. So I don't think we should discard protocol-http just yet. But there's a lot of duplicate code between these, which is

RE: Distributed nutch

2005-11-09 Thread Rozina Sorathia

Thanx for the explanation :) -Original Message- From: Paul Baclace [mailto:[EMAIL PROTECTED] Sent: Thursday, November 10, 2005 5:18 AM To: nutch-dev@lucene.apache.org Subject: Re: Distributed nutch In addition to Stefan Groschupf's detailed references, here are some short, high-level

What is suitable environment?

2005-11-09 Thread KAAS INFOTECH

Hi All, I am new to nutch. I have downloaded latest nutch-0.7.1. I have Microsoft window install on my PC with Java home Set. I came to know that cgywin is require to run nutch? Why Cannot I run nutch from windows? If so Do I need to change to Linux(any flavor of unix) ? How can we create test

Re: Index update and Google Dance

Re: Index update and Google Dance

Re: Index update and Google Dance

Re: [Nutch-dev] [jira] Resolved: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

Re: Lucene or Nutch

Re: Lucene or Nutch

RE: Lucene or Nutch

protocol-http versus protocol-httpclient

Request for info regarding filesystem based index.

Re: mapred bug -- bad part calculation?

Re: Lucene or Nutch

Re: Index update and Google Dance

Re: [Nutch-dev] [jira] Resolved: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

Re: Lucene or Nutch

Re: protocol-http versus protocol-httpclient

[jira] Closed: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Re: Distributed nutch

Re: Index update and Google Dance

RE: protocol-http versus protocol-httpclient

Re: protocol-http versus protocol-httpclient

RE: Distributed nutch

What is suitable environment?

22 matches

Site Navigation

Mail list logo

Footer information