On Thu, Mar 5, 2009 at 11:50 PM, donald <[email protected]> wrote:

>
> So I have done more digging on this subject...
>
> There is another problem if many files are kept open at the same time:
> once you read some data from a HDFS file by calling FSInputStream.read
> (byte[] buf, int off, int len), a tcp connection between HdfsBroker
> and the DataNode that contains the file block is set up, this
> connection is kept until you read another block (by default 64MB in
> size) of the file, or close the file entirely. There is a timeout on
> the server side, but I see no clue on the client side. So you quickly
> end up with a lot of idle connections between the HdfsBroker and many
> DataNodes.
>
> What's even worse, no matter how many bytes the application wants to
> read, the HDFS client library always requests the the chosen DataNode
> to send all the remaining bytes of the block. Which means if you read
> 1 byte from the beginning of a block, the DataNode actually gets the
> request of sending the whole block, of which only the first few bytes
> are read. Consequences are: if the client reads nothing for quite a
> long while, 1) the kernel tcp send queue on the DataNode side and the
> tcp receive queue on the client side are quickly fed up; 2) the
> DataNode Xceiver thread (there is a max count of 256 by default) is
> blocked. Eventually the Xceiver timeouts, and closes the connection.
> However this FIN packet cannot reach client side as send&receive
> queues are still blocked. Here is what I observe from one node of our
> test cluster:
> $ netstat -ntp
> (Not all processes could be identified, non-owned process info
>  will not be shown, you would have to be root to see it all.)
> Active Internet connections (w/o servers)
> Proto Recv-Q Send-Q Local Address               Foreign
> Address             State       PID/Program name
> tcp        0 121937 10.65.25.150:50010
> 10.65.25.150:38595          FIN_WAIT1   -
> [...]
> tcp    74672      0 10.65.25.150:38595
> 10.65.25.150:50010          ESTABLISHED 32667/java
> [...]
> (and hundreds of other connections in the same states)
>
> Possible solutions without modifying hadoop client library are: 1)
> open-read-close the file stream every time the cell store is accessed;
> 2) always use postioned read: read(long position, byte[] buf, int off,
> int len) instead, because pread doesn't keep the tcp connection with
> DataNodes. Solution 1 is not scalable because every open() operation
> includes interaction with HDFS NameNode, which immediately becomes a
> bottleneck: in our test cluster the NameNode can only handle hundreds
> of parallel open() request per second, with an average delay of 2-3ms.
> I haven't tested the performance of solution 2 yet, I will put up some
> numbers tomorrow.
>
> Donald


I've created 1000 files in a 11-node hadoop cluster, each file is 20MB. Then
I wrote simple java programs to do the following tests:

Opening all 1000 files, one process: about 2.5 s (2.5ms latency)
Closing all 1000files, one process: 50ms
Opening all 1000 files, 10 processes (running distributedly on the 10
datanodes): 15 s (15ms latency, or 700 opens/s)
Reading the first 1KB data from each file (plain read), one process: 6s (6ms
latency)
Reading the first 1KB data from each file (positioned read), one process:
2.5s (2.5ms latency)
Reading the first 100KB data from each file, 1KB at a time (positioned
read), one process: 130s (1.3ms latency, or 0.77MB/s)
Reading the first 100KB data from each file, 1KB at a time (plain read), one
process: 8.8s (11MB/s)

The tests are done multiple times to make sure all blocks are effectively
cached in Linux page cache.The hadoop version was 0.19.0 with a few patches.
io.file.buffer.size = 4096

Donald


>
>
> On Feb 25, 8:59 pm, "Liu Kejia (Donald)" <[email protected]> wrote:
> > It turns out the hadoop-default.xml packaged in my custom
> > hadoop-0.19.0-core.jar has set the "io.file.buffer.size" to 131072
> (128KB),
> > which means DfsBroker has to open a 128KB buffer for every open file. The
> > official hadoop-0.19.0-core.jar sets this value to 4096, which is more
> > reasonable for applications like Hypertable.
> > Donald
> >
> > On Fri, Feb 20, 2009 at 11:55 AM, Liu Kejia (Donald)
> > <[email protected]>wrote:
> >
> > > Caching might not work very well because keys are randomly generated,
> > > resulting in bad locality...
> > > Even it's Java, hundreds of kilobytes per file object is still very
> big.
> > > I'll profile HdfsBroker to see what exactly is using so much memory,
> and
> > > post the results later.
> >
> > > Donald
> >
> > > On Fri, Feb 20, 2009 at 11:20 AM, Doug Judd <[email protected]>
> wrote:
> >
> > >> Hi Donald,
> >
> > >> Interesting.  One possibility would be to have an open CellStore
> cache.
> > >> Frequently accessed CellStores would remain open, while seldom used
> ones get
> > >> closed.  The effectiveness of this solution would depend on the
> workload.
> > >> Do you think this might work for your use case?
> >
> > >> - Doug
> >
> > >> On Thu, Feb 19, 2009 at 7:09 PM, donald <[email protected]> wrote:
> >
> > >>> Hi all,
> >
> > >>> I recently run into the problem that HdfsBroker throws out of memory
> > >>> exception, because too many CellStore files in HDFS are kept open - I
> > >>> have over 600 ranges per range server, with a maximum of 10 cell
> > >>> stores per range, that'll be 6,000 open files at the same time,
> making
> > >>> HdfsBroker to take gigabytes of memory.
> >
> > >>> If we open the CellStore file on demand, i.e. when a scanner is
> > >>> created on it, this problem is gone. However random-read performance
> > >>> may drop due to the the overhead of opening a file in HDFS. Any
> better
> > >>> solution?
> >
> > >>> Donald
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to