[hypertable-dev] Re: CellStore files should be open on demand?

Liu Kejia (Donald) Tue, 17 Mar 2009 20:40:07 -0700

Hi Doug,

I've made other changes to HdfsBroker.java and other files to make it
output more diagnose information e.g. read/write operations that take
more than 1 second, the read/write speed during the last minute etc. I
am sending you my whole src/java directory in a tarball in case you
are interested.


Donald

On Wed, Mar 18, 2009 at 11:13 AM, Doug Judd <[email protected]> wrote:
> Donald,
>
> Thanks for doing this.  I didn't notice any performance difference with my
> tests either.  Send me your HdfsBroker patch and I'll be sure it gets into
> the next release.
>
> - Doug
>
> On Tue, Mar 17, 2009 at 8:00 PM, donald <[email protected]> wrote:
>>
>> Hi Luke,
>>
>> I changed the seek-read logic in HdfsBroker to use HDFS's build-in
>> pread API last week on a 11-node cluster. As expected the number of
>> socket connections per node drops drastically from hundreds to about
>> 20. Meanwhile there is no significant change in performance.
>>
>> I am reading the HDFS 0.19 code and doing some tests these days,
>> results show HDFS create/write/close fails with high probability when
>> some nodes are under heavy load, especially when the number of nodes
>> is small. I'll post my analysis in details later in a new thread.
>>
>> Donald
>>
>> On Mar 8, 4:25 pm, "Liu Kejia (Donald)" <[email protected]> wrote:
>> > Get it! I was so careless not to find the problem of HdfsBroker...
>> > I'll give it a try and post the result soon. Thanks very much.
>> >
>> > Donald
>> >
>> > On Sun, Mar 8, 2009 at 4:15 PM, Luke <[email protected]> wrote:
>> >
>> > > On Mar 8, 12:06 am, "Liu Kejia (Donald)" <[email protected]> wrote:
>> > > > read(byte[] buf, int off, int len) has much better performance for
>> > > > large
>> > > > scans (e.g. while doing a merging compaction), if using 128KB
>> > > > buffers, it
>> > > > can make 80MB/s throughput, while read(long pos, byte[] buf, int
>> > > > off, int
>> > > > len) can only achieve 40MB/s. To make best utilization of HDFS
>> > > performance,
>> > > > we should use positioned read for random read/small scans to achieve
>> > > shorter
>> > > > response time. OTOH use plain read for large scans to have better
>> > > > throughput, and close-reopen the file after the scanner is destroyed
>> > > > to
>> > > > avoid socket congestion and fd leak.
>> > > > I then read the current CellCacheScanner code roughly and found it
>> > > already
>> > > > has dealt with this issue! I guess Doug's already aware of this long
>> > > before?
>> > > > Now I am all confused that we still have hundreds of tcp connections
>> > > > in
>> > > > FIN_WAIT1 state on every DataNode, I really wonder what's the real
>> > > > cause.
>> >
>> > > Yes, Doug's already using the right Hypertable::Filesystem API. The
>> > > problem is the implementation of PositionRead in HdfsBroker. It is
>> > > actually using seek and plain read. Doug just made the change on
>> > > Friday to use positioned read, he didn't see any performance
>> > > difference in brief tests.
>> >
>> > > I wonder if you can do us a favor and test the change on the cluster.
>> >
>> > > __Luke
>> >
>> > > > Donald
>> >
>> > > > On Sat, Mar 7, 2009 at 4:01 AM, Luke <[email protected]> wrote:
>> >
>> > > > > Sorry, I didn't read your posts fully (was in a hurry). DfsBroker
>> > > > > does
>> > > > > have pread interface, which is used by range server for random
>> > > > > reads.
>> > > > > We just need to fix our HdfsBroker (PositionRead) to use HDFS
>> > > > > positioned read interface. Can you check if that improve things
>> > > > > for
>> > > > > you?
>> >
>> > > > > On Mar 6, 11:37 am, Luke <[email protected]> wrote:
>> > > > > > It appears that HDFS does have pread like interface:
>> > > > > > readFully(pos,
>> > > > > > buf, len). Can you run the tests again using this API and see if
>> > > > > > things improve?
>> >
>> > > > > > On Mar 6, 11:18 am, Luke Lu <[email protected]> wrote:
>> >
>> > > > > > > Great analysis Donald! Thanks for the numbers. It seems to me
>> > > > > > > the
>> > > > > > > right fix would be enhance the HDFS client library to add a
>> > > > > > > pread
>> > > like
>> >
>> > > > > > > interface to do the right thing for random reads. Maybe you
>> > > > > > > want to
>> > > > > > > file a Hadoop jira ticket for that?
>> >
>> > > > > > > __Luke
>> >
>> > > > > > > On Mar 6, 2009, at 3:15 AM, Liu Kejia (Donald) wrote:
>> >
>> > > > > > > > On Thu, Mar 5, 2009 at 11:50 PM, donald
>> > > > > > > > <[email protected]>
>> > > > > wrote:
>> >
>> > > > > > > > So I have done more digging on this subject...
>> >
>> > > > > > > > There is another problem if many files are kept open at the
>> > > > > > > > same
>> > > > > time:
>> > > > > > > > once you read some data from a HDFS file by calling
>> > > > > FSInputStream.read
>> > > > > > > > (byte[] buf, int off, int len), a tcp connection between
>> > > HdfsBroker
>> > > > > > > > and the DataNode that contains the file block is set up,
>> > > > > > > > this
>> > > > > > > > connection is kept until you read another block (by default
>> > > > > > > > 64MB
>> > > in
>> > > > > > > > size) of the file, or close the file entirely. There is a
>> > > > > > > > timeout
>> > > on
>> > > > > > > > the server side, but I see no clue on the client side. So
>> > > > > > > > you
>> > > quickly
>> > > > > > > > end up with a lot of idle connections between the HdfsBroker
>> > > > > > > > and
>> > > many
>> > > > > > > > DataNodes.
>> >
>> > > > > > > > What's even worse, no matter how many bytes the application
>> > > > > > > > wants
>> > > to
>> > > > > > > > read, the HDFS client library always requests the the chosen
>> > > DataNode
>> > > > > > > > to send all the remaining bytes of the block. Which means if
>> > > > > > > > you
>> > > read
>> > > > > > > > 1 byte from the beginning of a block, the DataNode actually
>> > > > > > > > gets
>> > > the
>> > > > > > > > request of sending the whole block, of which only the first
>> > > > > > > > few
>> > > bytes
>> > > > > > > > are read. Consequences are: if the client reads nothing for
>> > > > > > > > quite
>> > > a
>> > > > > > > > long while, 1) the kernel tcp send queue on the DataNode
>> > > > > > > > side and
>> > > the
>> > > > > > > > tcp receive queue on the client side are quickly fed up; 2)
>> > > > > > > > the
>> > > > > > > > DataNode Xceiver thread (there is a max count of 256 by
>> > > > > > > > default)
>> > > is
>> > > > > > > > blocked. Eventually the Xceiver timeouts, and closes the
>> > > connection.
>> > > > > > > > However this FIN packet cannot reach client side as
>> > > > > > > > send&receive
>> > > > > > > > queues are still blocked. Here is what I observe from one
>> > > > > > > > node of
>> > > our
>> > > > > > > > test cluster:
>> > > > > > > > $ netstat -ntp
>> > > > > > > > (Not all processes could be identified, non-owned process
>> > > > > > > > info
>> > > > > > > >  will not be shown, you would have to be root to see it
>> > > > > > > > all.)
>> > > > > > > > Active Internet connections (w/o servers)
>> > > > > > > > Proto Recv-Q Send-Q Local Address               Foreign
>> > > > > > > > Address             State       PID/Program name
>> > > > > > > > tcp        0 121937 10.65.25.150:50010
>> > > > > > > > 10.65.25.150:38595          FIN_WAIT1   -
>> > > > > > > > [...]
>> > > > > > > > tcp    74672      0 10.65.25.150:38595
>> > > > > > > > 10.65.25.150:50010          ESTABLISHED 32667/java
>> > > > > > > > [...]
>> > > > > > > > (and hundreds of other connections in the same states)
>> >
>> > > > > > > > Possible solutions without modifying hadoop client library
>> > > > > > > > are:
>> > > 1)
>> > > > > > > > open-read-close the file stream every time the cell store is
>> > > > > accessed;
>> > > > > > > > 2) always use postioned read: read(long position, byte[]
>> > > > > > > > buf, int
>> > > > > off,
>> > > > > > > > int len) instead, because pread doesn't keep the tcp
>> > > > > > > > connection
>> > > with
>> > > > > > > > DataNodes. Solution 1 is not scalable because every open()
>> > > operation
>> > > > > > > > includes interaction with HDFS NameNode, which immediately
>> > > becomes a
>> > > > > > > > bottleneck: in our test cluster the NameNode can only handle
>> > > hundreds
>> > > > > > > > of parallel open() request per second, with an average delay
>> > > > > > > > of
>> > > > > 2-3ms.
>> > > > > > > > I haven't tested the performance of solution 2 yet, I will
>> > > > > > > > put up
>> > > > > some
>> > > > > > > > numbers tomorrow.
>> >
>> > > > > > > > Donald
>> >
>> > > > > > > > I've created 1000 files in a 11-node hadoop cluster, each
>> > > > > > > > file is
>> > > > > > > > 20MB. Then I wrote simple java programs to do the following
>> > > tests:
>> >
>> > > > > > > > Opening all 1000 files, one process: about 2.5 s (2.5ms
>> > > > > > > > latency)
>> > > > > > > > Closing all 1000files, one process: 50ms
>> > > > > > > > Opening all 1000 files, 10 processes (running distributedly
>> > > > > > > > on
>> > > the
>> > > > > > > > 10 datanodes): 15 s (15ms latency, or 700 opens/s)
>> > > > > > > > Reading the first 1KB data from each file (plain read), one
>> > > process:
>> >
>> > > > > > > > 6s (6ms latency)
>> > > > > > > > Reading the first 1KB data from each file (positioned read),
>> > > > > > > > one
>> > > > > > > > process: 2.5s (2.5ms latency)
>> > > > > > > > Reading the first 100KB data from each file, 1KB at a time
>> > > > > > > > (positioned read), one process: 130s (1.3ms latency, or
>> > > > > > > > 0.77MB/s)
>> > > > > > > > Reading the first 100KB data from each file, 1KB at a time
>> > > > > > > > (plain
>> > > > > > > > read), one process: 8.8s (11MB/s)
>> >
>> > > > > > > > The tests are done multiple times to make sure all blocks
>> > > > > > > > are
>> > > > > > > > effectively cached in Linux page cache.The hadoop version
>> > > > > > > > was
>> > > 0.19.0
>> >
>> > > > > > > > with a few patches. io.file.buffer.size = 4096
>> >
>> > > > > > > > Donald
>> >
>> > > > > > > > On Feb 25, 8:59 pm, "Liu Kejia (Donald)"
>> > > > > > > > <[email protected]>
>> > > > > wrote:
>> > > > > > > > > It turns out the hadoop-default.xml packaged in my custom
>> > > > > > > > > hadoop-0.19.0-core.jar has set the "io.file.buffer.size"
>> > > > > > > > > to
>> > > 131072
>> >
>> > > > > > > > (128KB),
>> > > > > > > > > which means DfsBroker has to open a 128KB buffer for every
>> > > > > > > > > open
>> > > > > > > > file. The
>> > > > > > > > > official hadoop-0.19.0-core.jar sets this value to 4096,
>> > > > > > > > > which
>> > > is
>> > > > > > > > more
>> > > > > > > > > reasonable for applications like Hypertable.
>> > > > > > > > > Donald
>> >
>> > > > > > > > > On Fri, Feb 20, 2009 at 11:55 AM, Liu Kejia (Donald)
>> > > > > > > > > <[email protected]>wrote:
>> >
>> > > > > > > > > > Caching might not work very well because keys are
>> > > > > > > > > > randomly
>> > > > > > > > generated,
>> > > > > > > > > > resulting in bad locality...
>> > > > > > > > > > Even it's Java, hundreds of kilobytes per file object is
>> > > still
>> > > > > > > > very big.
>> > > > > > > > > > I'll profile HdfsBroker to see what exactly is using so
>> > > > > > > > > > much
>> > > > > > > > memory, and
>> > > > > > > > > > post the results later.
>> >
>> > > > > > > > > > Donald
>> >
>> > > > > > > > > > On Fri, Feb 20, 2009 at 11:20 AM, Doug Judd
>> > > > > > > > <[email protected]> wrote:
>> >
>> > > > > > > > > >> Hi Donald,
>> >
>> > > > > > > > > >> Interesting.  One possibility would be to have an open
>> > > > > > > > CellStore cache.
>> > > > > > > > > >> Frequently accessed CellStores would remain open, while
>> > > seldom
>> > > > > > > > used ones get
>> > > > > > > > > >> closed.  The effectiveness of this solution would
>> > > > > > > > > >> depend on
>> > > the
>> >
>> > > > > > > > workload.
>> > > > > > > > > >> Do you think this might work for your use case?
>> >
>> > > > > > > > > >> - Doug
>> >
>> > > > > > > > > >> On Thu, Feb 19, 2009 at 7:09 PM, donald <
>> > > [email protected]>
>> >
>> > > > > > > > wrote:
>> >
>> > > > > > > > > >>> Hi all,
>> >
>> > > > > > > > > >>> I recently run into the problem that HdfsBroker throws
>> > > > > > > > > >>> out
>> > > of
>> > > > > > > > memory
>> > > > > > > > > >>> exception, because too many CellStore files in HDFS
>> > > > > > > > > >>> are
>> > > kept
>> > > > > > > > open - I
>> > > > > > > > > >>> have over 600 ranges per range server, with a maximum
>> > > > > > > > > >>> of 10
>> > > > > cell
>> > > > > > > > > >>> stores per range, that'll be 6,000 open files at the
>> > > > > > > > > >>> same
>> > > > > > > > time, making
>> > > > > > > > > >>> HdfsBroker to take gigabytes of memory.
>> >
>> > > > > > > > > >>> If we open the CellStore file on demand, i.e. when a
>> > > scanner is
>> > > > > > > > > >>> created on it, this problem is gone. However
>> > > > > > > > > >>> random-read
>> > > > > > > > performance
>> > > > > > > > > >>> may drop due to the the overhead of opening a file in
>> > > > > > > > > >>> HDFS.
>> > > > > > > > Any better
>> >
>> > ...
>> >
>> > read more »
>>
>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

java.tgz
Description: GNU Zip compressed data

[hypertable-dev] Re: CellStore files should be open on demand?

Reply via email to