I've been converting to the latest release and will pull cvs tonight, but
i was wondering if there are scoring tweaks recommended for getting better
summaries when doing a non intranet crawl.
I ofcourse can't find my backup copy of nutch-site when i had better
summaries and i'm not sure if its
I use mod_jk as well as squid. At one point i had 3 web servers and ran
tomcat stand alone (and resin) and used a squid caching server to proxy
requests on port 80 (as well as load balance)
Just depends on how you want to go. Single node, mod_jk works the best.
-byron
-Original
If you have your own servers i love to use screen.
Login - type screen and you get a virtual pty. Start bin/nutch fetch
$s1 (or whatever command you want)
ctrl-a-d - detaches the screen - you can then logout.
log back in - screen -r and you resume your screened sessions.
You can also
Did you make sure to include the filters in your plugin settings
(conf/nutch-site.xml)
I must admit, i haven't paid attention to check to see if the plugin is
used during the fetch or only when you update the DB (or both?).
-Original Message-
From: EM [EMAIL PROTECTED]
To:
To add from my experiences:
I've preferred Resin (stability performance)
I always go for more ram than more servers. It's cheaper in the long run
when it comes to man hours and service as well as MTBF for your hardware.
Use Squid to proxy/load balance your java servers. This helped alleviate
[%=summxml%]]/description
/item
%
}
}
%
/channel
/rss
--- Andrzej Bialecki [EMAIL PROTECTED] wrote:
Byron Miller wrote:
Oh yeah, does anyone have any tips on cleaning up
the SUMMARIES so any
lingering code, cntrl characters or non XML valid
characters don't come
through
Is there an archive of the mailing list anymore that is searchable? The
old lists on sourceforge are gone and the one on apache's site is just a
flatfile of recent subjects.
I'm interested in looking up the info on Mapreduce and what it does as
well as stuff i missed while i was out :)
Dough All,
What is the status of map reduce? i just got finished reading your paper
and all of the threads and i'm drooling over the notion of such a system :)
I haven't seen anything in the list, but is there any code available? I
jumped over to the lucene site but ofcourse the lists aren't searchable
right now (get an error)
In the faq i did find a reference to a spell checke based on
lucene/n-grams. Would this be the best way to offer the feature?
Cutting [EMAIL PROTECTED]
To: nutch-user@incubator.apache.org
Date: Tue, 19 Apr 2005 12:42:25 -0700
Subject: Re: did you mean feature
Byron Miller wrote:
I haven't seen anything in the list, but is there any code available?
I
jumped over to the lucene site but ofcourse the lists aren't
of actual
queries users have been doing and construct suggestions based on those.
--
Sami Siren
Byron Miller wrote:
Doug,
Thanks for the quick response! I'll take a look at the code and see
if i
can't come up with something to work.
At a quick glance, is this using
If you use the default nutch script i would set a
NUTCH_HEAPSIZE of 2000. That generally works for me
and i have over 100 million urls in db and generally
10 million urls per segment/index.
-byron
--- smith learner [EMAIL PROTECTED] wrote:
Thanks for your reply. But I guess this solution
and
create an index on it? (or am i barking up the wrong
tree here?)
--- Byron Miller [EMAIL PROTECTED] wrote:
I'm not sure what it is, but it seems i can only
index
about 28-32 pg/sec. While not terribly slow on its
own, it did take nearly 30+ hours to index a 4
million
page segment.
i used
Use Jira to look at existing work in progress or to create a todo/feature
request that you can attach your diff's to.
http://issues.apache.org/jira/browse/Nutch
Hint: Create an account login and you will get the create new issue
and from there you can do bug/feature/todo features and use the
Is it possiblte to build a bucket or container system that has x
amount size and scales to the next bucket once that size has been reached?
The issue i have is a db with 235 million pages takes FOREVER to do
anything on simply because it makes a duplicate of itself for all processes.
Would it
it to the
project? I didn't know about jira, so I'll take a look there to see if
there is something like it beeing implemented.
best []s
Leonardo Barbosa.
On 5/2/05, Byron Miller [EMAIL PROTECTED] wrote:
Use Jira to look at existing work in progress or to create a
todo/feature
request
the fat file system may not like some of the names of the index files.
Make sure you use fat32 or a native posix/unix file system
fat may not like the THOUSANDS of files that are created.. in my segments
of a few million documents during the index process i a few thousand files
in the index
Can't wait to try out the mapread stuff. Good luck in getting that branch
up and running :)
-Original Message-
From: [EMAIL PROTECTED]
To: nutch-user@incubator.apache.org
Date: Wed, 18 May 2005 09:31:03 -0700 (PDT)
Subject: Re: [Nutch-general] Re: Pre MapReduce Nutch release?
It all
like this.
thanks,
-byron
-Original Message-
From: Philippe LE NAOUR [EMAIL PROTECTED]
To: nutch-user@incubator.apache.org
Date: Sat, 21 May 2005 12:47:38 +0200
Subject: Re: Hardware requirements and some other questions about Nutch
Thanks for responding.
Byron Miller a écrit
If you don't run the DB analysis... ;-) Analysis can eat up a terabyte
for breakfast.
Indeed! we stopped doing db analyze and turned on the scoring per Doug's
recommendations - that saved tons of time resources :)
That leaves you enough room for your segmetns, db and the space
needed to
-user@incubator.apache.org
Date: Sun, 22 May 2005 00:37:55 +0200
Subject: Re: Hardware requirements and some other questions about Nutch
Byron Miller wrote:
Here is what the great Doug said:
Are you using link analysis? Perhaps it is doing you a disservice by
prioritizing one site
Not that this fixes your Tomcat issues, but i have nothing but good things
to say about Resin.
It handles the load really well, is easy to manage and is pretty
light-weight for what it does.
I have never had much luck with Tomcat and believe me i've tried many
times to go back.
Just my 2 cents.
The famous quite is Your mileage may vary. There is an open source
version of resin that you can run - caucho.com.
Like i said, i've been running nutch under resin for a LONG time. Under
tomcat i had issues after issues.
-byron
-Original Message-
From: [EMAIL PROTECTED] [EMAIL
23 matches
Mail list logo