Jay Pound wrote:
the bad news, when doing
a get it will use 100% of the cpu to pull down data @ 100mbit on a gigabit
machine, perhaps there is some code that could be cleaned up in the
org.apache.nutch.fs.TestClient to make this faster, or if it could open
multiple threads for recieving data to distribute the load across all cpu's
in the system.

I am much more interested in optimizing aggregate performance (rate at which all nodes can read files) than optimizing single-node performance.

now I was able to see a performance increase per machine
while running multiple datanodes on each box, by this i mean more network
throughput per box, so Doug if you run 4 datanodes per box if your 400gb
drives arent in a raid setup you will see a higher throughput per box for
datanode traffic. Doug I know your allready looking at the namenode to see
how to speed things up, may I request 2 things for NDFS that are going to be
needed.

Multiple datanodes are automatically started if you have a comma-separated list of data directories in your config file. One datanode thread is launched per data directory. These are assumed to be on separate devices.

1.) namenode-please thread out the different sections of the code, make the
replication a single thread, while put and get are seperate threads also,
this should speed things up when working in a large cluster, maybe also
lower the time it takes to respond to putting chunks on the machines, it
seems like it queues the put requests for each datanode, maybe run requests
in parallel for get and put instead of waiting for a response from the
datanode being requested? if I'm wrong on any of this sorry, I'm not a
programmer I dont know how to read the nutch code to see if this is true or
not. otherwise I would know the answer to these.

No file data ever flows through the namenode. All replication and file access is already in separate threads.

2.)datanode- please please please put the data into sub-directories like the
way squid does, I really do not want a single directory with a million
file's/chunks in it, reiser will do ok with it, but I'm running multiple
terabytes per datanode in a single logical drive configuration, I dont want
to run the filesystem to its limit and crash and lose all my data because
the machine wont boot (have experience in this area unfortunately).

This is a known problem. Please file a bug report. Better would be to submit a patch that fixes it, or hire someone to do so.

3.)excellent job on making it much more stable its very close to usable now
as it looks!!

Thanks!

PS: Doug I would like to talk with you sometime about this if you have an

Please use the mailing lists. I am fully booked and unavailable for hire at present.

Doug

Reply via email to