On Mar 8, 12:06 am, "Liu Kejia (Donald)" <[email protected]> wrote: > read(byte[] buf, int off, int len) has much better performance for large > scans (e.g. while doing a merging compaction), if using 128KB buffers, it > can make 80MB/s throughput, while read(long pos, byte[] buf, int off, int > len) can only achieve 40MB/s. To make best utilization of HDFS performance, > we should use positioned read for random read/small scans to achieve shorter > response time. OTOH use plain read for large scans to have better > throughput, and close-reopen the file after the scanner is destroyed to > avoid socket congestion and fd leak. > I then read the current CellCacheScanner code roughly and found it already > has dealt with this issue! I guess Doug's already aware of this long before? > Now I am all confused that we still have hundreds of tcp connections in > FIN_WAIT1 state on every DataNode, I really wonder what's the real cause.
Yes, Doug's already using the right Hypertable::Filesystem API. The problem is the implementation of PositionRead in HdfsBroker. It is actually using seek and plain read. Doug just made the change on Friday to use positioned read, he didn't see any performance difference in brief tests. I wonder if you can do us a favor and test the change on the cluster. __Luke > Donald > > On Sat, Mar 7, 2009 at 4:01 AM, Luke <[email protected]> wrote: > > > Sorry, I didn't read your posts fully (was in a hurry). DfsBroker does > > have pread interface, which is used by range server for random reads. > > We just need to fix our HdfsBroker (PositionRead) to use HDFS > > positioned read interface. Can you check if that improve things for > > you? > > > On Mar 6, 11:37 am, Luke <[email protected]> wrote: > > > It appears that HDFS does have pread like interface: readFully(pos, > > > buf, len). Can you run the tests again using this API and see if > > > things improve? > > > > On Mar 6, 11:18 am, Luke Lu <[email protected]> wrote: > > > > > Great analysis Donald! Thanks for the numbers. It seems to me the > > > > right fix would be enhance the HDFS client library to add a pread like > > > > > interface to do the right thing for random reads. Maybe you want to > > > > file a Hadoop jira ticket for that? > > > > > __Luke > > > > > On Mar 6, 2009, at 3:15 AM, Liu Kejia (Donald) wrote: > > > > > > On Thu, Mar 5, 2009 at 11:50 PM, donald <[email protected]> > > wrote: > > > > > > So I have done more digging on this subject... > > > > > > There is another problem if many files are kept open at the same > > time: > > > > > once you read some data from a HDFS file by calling > > FSInputStream.read > > > > > (byte[] buf, int off, int len), a tcp connection between HdfsBroker > > > > > and the DataNode that contains the file block is set up, this > > > > > connection is kept until you read another block (by default 64MB in > > > > > size) of the file, or close the file entirely. There is a timeout on > > > > > the server side, but I see no clue on the client side. So you quickly > > > > > end up with a lot of idle connections between the HdfsBroker and many > > > > > DataNodes. > > > > > > What's even worse, no matter how many bytes the application wants to > > > > > read, the HDFS client library always requests the the chosen DataNode > > > > > to send all the remaining bytes of the block. Which means if you read > > > > > 1 byte from the beginning of a block, the DataNode actually gets the > > > > > request of sending the whole block, of which only the first few bytes > > > > > are read. Consequences are: if the client reads nothing for quite a > > > > > long while, 1) the kernel tcp send queue on the DataNode side and the > > > > > tcp receive queue on the client side are quickly fed up; 2) the > > > > > DataNode Xceiver thread (there is a max count of 256 by default) is > > > > > blocked. Eventually the Xceiver timeouts, and closes the connection. > > > > > However this FIN packet cannot reach client side as send&receive > > > > > queues are still blocked. Here is what I observe from one node of our > > > > > test cluster: > > > > > $ netstat -ntp > > > > > (Not all processes could be identified, non-owned process info > > > > > will not be shown, you would have to be root to see it all.) > > > > > Active Internet connections (w/o servers) > > > > > Proto Recv-Q Send-Q Local Address Foreign > > > > > Address State PID/Program name > > > > > tcp 0 121937 10.65.25.150:50010 > > > > > 10.65.25.150:38595 FIN_WAIT1 - > > > > > [...] > > > > > tcp 74672 0 10.65.25.150:38595 > > > > > 10.65.25.150:50010 ESTABLISHED 32667/java > > > > > [...] > > > > > (and hundreds of other connections in the same states) > > > > > > Possible solutions without modifying hadoop client library are: 1) > > > > > open-read-close the file stream every time the cell store is > > accessed; > > > > > 2) always use postioned read: read(long position, byte[] buf, int > > off, > > > > > int len) instead, because pread doesn't keep the tcp connection with > > > > > DataNodes. Solution 1 is not scalable because every open() operation > > > > > includes interaction with HDFS NameNode, which immediately becomes a > > > > > bottleneck: in our test cluster the NameNode can only handle hundreds > > > > > of parallel open() request per second, with an average delay of > > 2-3ms. > > > > > I haven't tested the performance of solution 2 yet, I will put up > > some > > > > > numbers tomorrow. > > > > > > Donald > > > > > > I've created 1000 files in a 11-node hadoop cluster, each file is > > > > > 20MB. Then I wrote simple java programs to do the following tests: > > > > > > Opening all 1000 files, one process: about 2.5 s (2.5ms latency) > > > > > Closing all 1000files, one process: 50ms > > > > > Opening all 1000 files, 10 processes (running distributedly on the > > > > > 10 datanodes): 15 s (15ms latency, or 700 opens/s) > > > > > Reading the first 1KB data from each file (plain read), one process: > > > > > > 6s (6ms latency) > > > > > Reading the first 1KB data from each file (positioned read), one > > > > > process: 2.5s (2.5ms latency) > > > > > Reading the first 100KB data from each file, 1KB at a time > > > > > (positioned read), one process: 130s (1.3ms latency, or 0.77MB/s) > > > > > Reading the first 100KB data from each file, 1KB at a time (plain > > > > > read), one process: 8.8s (11MB/s) > > > > > > The tests are done multiple times to make sure all blocks are > > > > > effectively cached in Linux page cache.The hadoop version was 0.19.0 > > > > > > with a few patches. io.file.buffer.size = 4096 > > > > > > Donald > > > > > > On Feb 25, 8:59 pm, "Liu Kejia (Donald)" <[email protected]> > > wrote: > > > > > > It turns out the hadoop-default.xml packaged in my custom > > > > > > hadoop-0.19.0-core.jar has set the "io.file.buffer.size" to 131072 > > > > > > (128KB), > > > > > > which means DfsBroker has to open a 128KB buffer for every open > > > > > file. The > > > > > > official hadoop-0.19.0-core.jar sets this value to 4096, which is > > > > > more > > > > > > reasonable for applications like Hypertable. > > > > > > Donald > > > > > > > On Fri, Feb 20, 2009 at 11:55 AM, Liu Kejia (Donald) > > > > > > <[email protected]>wrote: > > > > > > > > Caching might not work very well because keys are randomly > > > > > generated, > > > > > > > resulting in bad locality... > > > > > > > Even it's Java, hundreds of kilobytes per file object is still > > > > > very big. > > > > > > > I'll profile HdfsBroker to see what exactly is using so much > > > > > memory, and > > > > > > > post the results later. > > > > > > > > Donald > > > > > > > > On Fri, Feb 20, 2009 at 11:20 AM, Doug Judd > > > > > <[email protected]> wrote: > > > > > > > >> Hi Donald, > > > > > > > >> Interesting. One possibility would be to have an open > > > > > CellStore cache. > > > > > > >> Frequently accessed CellStores would remain open, while seldom > > > > > used ones get > > > > > > >> closed. The effectiveness of this solution would depend on the > > > > > > workload. > > > > > > >> Do you think this might work for your use case? > > > > > > > >> - Doug > > > > > > > >> On Thu, Feb 19, 2009 at 7:09 PM, donald <[email protected]> > > > > > > wrote: > > > > > > > >>> Hi all, > > > > > > > >>> I recently run into the problem that HdfsBroker throws out of > > > > > memory > > > > > > >>> exception, because too many CellStore files in HDFS are kept > > > > > open - I > > > > > > >>> have over 600 ranges per range server, with a maximum of 10 > > cell > > > > > > >>> stores per range, that'll be 6,000 open files at the same > > > > > time, making > > > > > > >>> HdfsBroker to take gigabytes of memory. > > > > > > > >>> If we open the CellStore file on demand, i.e. when a scanner is > > > > > > >>> created on it, this problem is gone. However random-read > > > > > performance > > > > > > >>> may drop due to the the overhead of opening a file in HDFS. > > > > > Any better > > > > > > >>> solution? > > > > > > > >>> Donald --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
