[jira] Updated: (HBASE-2180) read performance from synchronizing hfile.fddatainputstream

stack (JIRA) Thu, 04 Feb 2010 23:44:52 -0800

     [ 
https://issues.apache.org/jira/browse/HBASE-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


stack updated HBASE-2180:
-------------------------

    Attachment: 2180-v2.patch

This patch includes fixes for tests making them use new getScanner method and 
includes small PE fix when --rows is small (We would NPE).  I might need a v3.  
A test is failing (TestGetDeleteTracker).  Need to investigate.

In testing on something that tries to resemble the yahoo papers testing -- ~20M 
rows per server, 116 regions on a RS and only one replica -- this patch seems 
to double the throughput if ~20 concurrent clients on a RS.  I tested scans and 
scan speeds are what they were w/ this patch in place.  They have not 
deterioated.

One thing I noticed was that scanning when the data is not local -- i.e. the 
data is in a DN on another machine -- there is added latency for sure.... 
taking maybe 25% as long again for the test to complete.  I need to see if same 
is true of random reads.  Cosmin suggested that the yahoo test with its single 
replica only might be doing lots of remote accessing and could be incurring the 
extra latency.

> read performance from synchronizing hfile.fddatainputstream
> -----------------------------------------------------------
>
>                 Key: HBASE-2180
>                 URL: https://issues.apache.org/jira/browse/HBASE-2180
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.21.0
>
>         Attachments: 2180-v2.patch, 2180.patch
>
>
> deep in the HFile read path, there is this code:
>     synchronized (in) {
>       in.seek(pos);
>       ret = in.read(b, off, n);
>     }
> this makes it so that only 1 read per file per thread is active. this 
> prevents the OS and hardware from being able to do IO scheduling by 
> optimizing lots of concurrent reads. 
> We need to either use a reentrant API (pread may be partially reentrant 
> according to Todd) or use multiple stream objects, 1 per scanner/thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2180) read performance from synchronizing hfile.fddatainputstream

Reply via email to