Jack Tang wrote:
Hi Andrzej
In document, Michael said:
I'd strongly recommend using the system with a replication rate of 3
copies, 2 minimum. Desired replication can be set in nutch config file
using ndfs.replication property, and MIN_REPLICATION constant is
located in ndfs/FSNamesystem.java
and three copies of chunks are distributed on the slaves. If slave 1
is 90% busy, and 2 is 80% busy, 3 is idle. How does NFS do in this
case?
Actually you have to do that manually, but there will be a
automatically solution later.
Or could you tell me where should I start learning?
The
Thanks for your explaination, Andrzej.
I am going to read some NFS source codes and ask smarter questions later.
Thanks again.
Regards
/Jack
On 11/9/05, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Jack Tang wrote:
Hi Andrzej
In document, Michael said:
I'd strongly recommend using the system
Ther's a problem with that solution. The protocol-httpclient now , for
some site, gerate a SEVERE Narrowly avoided an infinite loop in execute
So the fetcher exit ands only some pages is fetched until the SEVERE
message.
I don't know a solution, for now I switch back to protocoll-http.
Yes, Lucene is the best fit for what you're after. Nutch is built on
Lucene, and adds web crawling on top. You don't need a web crawler,
so using Lucene directly is the best fit - of course you'll have to
write code to integrate Lucene.
Erik
On 9 Nov 2005, at 08:48, Klaus wrote:
Yes, Lucene is the best fit for what you're after. Nutch is built on
Lucene, and adds web crawling on top. You don't need a web crawler,
so using Lucene directly is the best fit - of course you'll have to
write code to integrate Lucene.
Erik,
I was thinking about it for a while, but don't
Hello All
My question is kind of related to the email below. I was exploring the option
to full-text index a fairly large database that's 40G in size (data alone minus
indices etc). This data resides in Oracle which has its own full text indexing
engine. Does anyone have a recommendation
I was recently benchmarking fetching at a site with lots of bandwidth,
and it seemed to me that protocol-http is capable of faster crawling
than protocol-httpclient. So I don't think we should discard
protocol-http just yet. But there's a lot of duplicate code between
these, which is
Here's the problem:
I need to get the Nutch engine running on a collection of xml documents that
I have (containing news stories). The files are named in the following
manner:
example.xml.52908
example.xml.52909
example.xml.52910
example.xml.52911
...
example.xml.53365
example.xml.53366
Rod Taylor wrote:
The attached patches for Generator.java and Injector.java allow a
specific temporary directory to be specified. This gives Nutch the full
path to these temporary directories and seems to fix the No input
directories issue when using a local filesystem with multiple task
Jérôme Charron wrote:
In fact, I think it could be a good idea to move the nutch language
identifier core code
to a standalone library or to lucene code.
Does it make sense? What do you think about it? What is the best solution
(standalone vs lucene)?
One could put it in the lucene contrib
Jack Tang wrote:
Below is google architecture in my brain:
DataNode A
Master DataNode B GoogleCrawler
DataNode C
..
GoogleCrawler is kept running all the time. One day, it gets fethlist
from DataNode A, crawls all pages and
Massimo Miccoli wrote:
Ther's a problem with that solution. The protocol-httpclient now , for
some site, gerate a SEVERE Narrowly avoided an infinite loop in execute
So the fetcher exit ands only some pages is fetched until the SEVERE
message.
I don't know a solution, for now I switch back
Doug Cutting wrote:
Jérôme Charron wrote:
In fact, I think it could be a good idea to move the nutch language
identifier core code
to a standalone library or to lucene code.
Does it make sense? What do you think about it? What is the best
solution
(standalone vs lucene)?
One could put it
+1
I've been planning to switch my crawler over to use protocol-
httpclient, but haven't got there yet. Interesting that there seems
to be a performance impact with the new plugin.
(In my crawl setup, I override the default HTTP plugin so I can
modify HTML content before it is written to
[ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]
Andrzej Bialecki closed NUTCH-109:
---
Resolution: Invalid
Proposed improvement is not real, and comes from different config. settings.
Proposed implementation uses a component with
In addition to Stefan Groschupf's detailed references, here are some short,
high-level answers to your questions:
Rozina Sorathia wrote:
1. What is Distributed nutch
Nutch is a distributed Lucene with large scale web crawling.
2. How nutch distributed works?
Modeled after Google's
Hi Doug
On 11/10/05, Doug Cutting [EMAIL PROTECTED] wrote:
Jack Tang wrote:
Below is google architecture in my brain:
DataNode A
Master DataNode B GoogleCrawler
DataNode C
..
GoogleCrawler is kept running all
Doug Cutting wrote:
... protocol-http is capable of faster crawling than protocol-httpclient.
So I don't think we should discard protocol-http just yet.
What do others think?
I think:
HttpClient-based [protocol-httpclient] uses own Threads.
[protocol-http] does not create Threads.
We should
I was recently benchmarking fetching at a site with lots of
bandwidth, and it seemed to me that protocol-http is capable of
faster crawling than protocol-httpclient. So I don't think we should
discard protocol-http just yet. But there's a lot of duplicate code
between these, which is
Thanx for the explanation :)
-Original Message-
From: Paul Baclace [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 10, 2005 5:18 AM
To: nutch-dev@lucene.apache.org
Subject: Re: Distributed nutch
In addition to Stefan Groschupf's detailed references, here are some
short, high-level
Hi All,
I am new to nutch. I have downloaded latest nutch-0.7.1. I have Microsoft
window install on my PC with Java home Set. I came to know that cgywin is
require to run nutch? Why Cannot I run nutch from windows? If so Do I need
to change to Linux(any flavor of unix) ?
How can we create test
22 matches
Mail list logo