Jay Pound wrote:
the bad news, when doing a get it will use 100% of the cpu to pull down data @ 100mbit on a gigabit machine, perhaps there is some code that could be cleaned up in the org.apache.nutch.fs.TestClient to make this faster, or if it could open multiple threads for recieving data to distribute the load across all cpu's in the system.
I am much more interested in optimizing aggregate performance (rate at which all nodes can read files) than optimizing single-node performance.
now I was able to see a performance increase per machine while running multiple datanodes on each box, by this i mean more network throughput per box, so Doug if you run 4 datanodes per box if your 400gb drives arent in a raid setup you will see a higher throughput per box for datanode traffic. Doug I know your allready looking at the namenode to see how to speed things up, may I request 2 things for NDFS that are going to be needed.
Multiple datanodes are automatically started if you have a comma-separated list of data directories in your config file. One datanode thread is launched per data directory. These are assumed to be on separate devices.
1.) namenode-please thread out the different sections of the code, make the replication a single thread, while put and get are seperate threads also, this should speed things up when working in a large cluster, maybe also lower the time it takes to respond to putting chunks on the machines, it seems like it queues the put requests for each datanode, maybe run requests in parallel for get and put instead of waiting for a response from the datanode being requested? if I'm wrong on any of this sorry, I'm not a programmer I dont know how to read the nutch code to see if this is true or not. otherwise I would know the answer to these.
No file data ever flows through the namenode. All replication and file access is already in separate threads.
2.)datanode- please please please put the data into sub-directories like the way squid does, I really do not want a single directory with a million file's/chunks in it, reiser will do ok with it, but I'm running multiple terabytes per datanode in a single logical drive configuration, I dont want to run the filesystem to its limit and crash and lose all my data because the machine wont boot (have experience in this area unfortunately).
This is a known problem. Please file a bug report. Better would be to submit a patch that fixes it, or hire someone to do so.
3.)excellent job on making it much more stable its very close to usable now as it looks!!
Thanks!
PS: Doug I would like to talk with you sometime about this if you have an
Please use the mailing lists. I am fully booked and unavailable for hire at present.
Doug
