[Nutch-dev] Re: Fw: Re: near-term plan

Piotr Kosiorowski Fri, 05 Aug 2005 05:22:10 -0700

I think it was already answered by Doug ealier in this thread.

"... Yes.  It is alpha-quality, not yet release-worthy, but it works.  If
you're an experienced Java developer, I'd encourage you to give it a
try.  If you're a user who doesn't want to look beyond the config files,
then I'd wait a bit."


P.

On 8/5/05, Jay Pound <[EMAIL PROTECTED]> wrote:
> is the mapreduce working yet?
> I would also like to test it.
> -J
> ----- Original Message -----
> From: "Piotr Kosiorowski" <[EMAIL PROTECTED]>
> To: <[email protected]>
> Sent: Friday, August 05, 2005 8:06 AM
> Subject: Re: Fw: Re: near-term plan
> 
> 
> > I am not sure what you exactly did in this test but I understand you
> > were using jar file prepared by me (it was nutch from trunk + ndfs
> > patches). As these patches were applied by Andrzej some time ago - we
> > can assume you were using NDFS code from trunk.
> > Because a lot of work went into mapreduce branch it woul dbe good to
> > test it with mapreduce branch code.
> > Regards
> > Piotr
> >
> > On 8/5/05, webmaster <[EMAIL PROTECTED]> wrote:
> > >
> > > ---------- Forwarded Message -----------
> > > From: "webmaster" <[EMAIL PROTECTED]>
> > > To: [email protected]
> > > Sent: Thu, 4 Aug 2005 19:42:53 -0500
> > > Subject: Re: near-term plan
> > >
> > > I was using a nightly build that Pitor had given me the
> nutch-nightly.jar
> > > (actually it was nutch-dev0.7.jar or something of that nature) I tested
> it on
> > > the windows platform, I had 5 machines running it, 2 at 100 mbit both
> quad p3
> > > xeon, 1 pentium 4 3ghz hyperthreading, 1 amd athlon xp 2600+ and 1
> Athlon 64
> > > 3500+. all have 1gb or more of ram. now I have my big server and if you
> have
> > > worked on ndfs since the begining of july I'll test it again, my big
> server's
> > > HD array is very fast 200+mbytes a sec, so it will be able to fully
> saturate
> > > gigabit better. anyway the p4 and the 2 amd machines are hooked into the
> > > switch at gigabit and the 2 xeons are hooked into my other switch at
> 100mbit,
> > > but it has a gigabit uplink to my gigabit switch, so both xeons would
> > > constantly be saturated at 11mbytes a sec. while the p4 was able to
> reach
> > > higher speeds of 50-60mbytes a sec with its internal raid 0 array (dual
> 120gb
> > > drives) my main pc (athlon 64 3500+) was the namenode and a datanode and
> also
> > > the ndfs client, I could not get nutch to work properly with ndfs, it
> was
> > > setup correctly, it "kinda" worked but would crash out the namenode when
> I
> > > was trying to fetch segments in the ndfs filesystem or index them, or do
> much
> > > of anything. so I copied all my segment directories, indexes,
> > > content.wtahever it was 1.8gb and some dvd images onto ndfs. my primary
> > > machine and nutch runs off 10000rpm disks raid 0 (2x36gb raptors) they
> can
> > > output about 120mbytes a sec sustained so here is what I found out ( in
> > > windows) if I dont start a datanode on the namenode with the conf
> pointing to
> > > 127.0.0.1 instead of its outside ip the namenode will not copy data to
> the
> > > other machines, instead if I'm running datanode on the namenode data
> will
> > > replicate from the datanode to the other 3 datanodes, I tried this a
> hundred
> > > ways to try and make it work with an independant namenode without luck.
> but
> > > the way I saw data go across my network was I would put data into ndfs
> the
> > > namenode would request a datanode and find the internal datanode, copy
> data
> > > to it only then after that the datanode would still be coping data from
> my
> > > other hd's into chunks on the raid array, while copying it would
> replicate to
> > > the p4 via gigabit at 50-60mbytes a sec, then it would replicate from
> the p4
> > > to the xeons kinda like alternating them as I only had replication at
> default
> > > 2 and i had about 100gbytes to copy in so the copy would finish onto the
> > > internal raid array fairly quickly then it finished replication to the
> p4 and
> > > the xeons got a little bit of data, but not near as much as the p4, my
> guess
> > > is it only needs 2 copies and the first copy was datanode on the
> internal
> > > machine, the second was the p4 datanode. the xeons only had a smaller
> > > connection so they didnt recieve as many chunks as fast as the p4 could,
> and
> > > the p4 had enough space for all the data so it worked out, I should of
> put
> > > replication to 4. the amd athlon xp 1900+ was running linux suse 9.3 and
> it
> > > would crash the namenode on windows if I connected it as a datanode. so
> that
> > > one didnt get tested, but I was able to put out 50-60 mbytes a sec to 1
> > > machine, but it would not replicate data to multiple machines at the
> same
> > > time it seemed. I would of thought it would of output to the xeons at
> the
> > > same time as the p4, give the xeons 20% of the data and the p4 80% or
> > > something of that nature, but it could be that they just arent fast
> enough to
> > > request data before the p4 was recieving its 32mb chunks every 1/2
> second?
> > > The good news cpu usage was only at 50% on my amd 3500+ that was while
> it was
> > > copying data to the internal datanode from the ndfs client from another
> > > internal HD running the namenode and running the datanode internally.
> does it
> > > now work with a separate namenode? I'm getting ready to run nutch in
> linux
> > > full time, if I can ever get the damn driver for my highpoint 2220 raid
> card
> > > to work with suse, any suse, the drivers dont work with dual core cpu's
> or
> > > something??? they are working on it, now I'm stuck with fedora 4 untill
> they
> > > fix it. so its not ready for testing yet. I'll let you know when I can
> test
> > > it in a full linux environment.
> > > wow that was a long one!!!
> > > -Jay
> > > ------- End of Forwarded Message -------
> > >
> > >
> > > --
> > > Pound Web Hosting www.poundwebhosting.com
> > > (607)-435-3048
> > >
> >
> >
> 
> 
>


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Fw: Re: near-term plan

Reply via email to