[jira] [Commented] (HBASE-5979) Non-pread DFSInputStreams should be associated with scanners, not HFile.Readers

Todd Lipcon (JIRA) Tue, 15 May 2012 10:30:32 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276018#comment-13276018
 ]


Todd Lipcon commented on HBASE-5979:
------------------------------------

Kannan: I tried to work on this a bit in my spare time, but didn't get very 
far. So if FB folks have cycles to work on it, that would be awesome!

I think one route is to do like I suggested above and have the 
StoreFileScanners hold a DFSInputStream. Another option would be to make a 
wrapper FileSystem (or FSReader) which pools a few streams. Then change the 
scanners to always issue positional reads, and have the wrapper code look for 
any stream which is already seeked to the right position (or just before the 
right position). The advantage of this technique is that we'd end up getting 
the same sequential read benefit, even if the user was issuing normal get() 
calls in ascending row order.
                
> Non-pread DFSInputStreams should be associated with scanners, not 
> HFile.Readers
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-5979
>                 URL: https://issues.apache.org/jira/browse/HBASE-5979
>             Project: HBase
>          Issue Type: Improvement
>          Components: performance, regionserver
>            Reporter: Todd Lipcon
>
> Currently, every HFile.Reader has a single DFSInputStream, which it uses to 
> service all gets and scans. For gets, we use the positional read API (aka 
> "pread") and for scans we use a synchronized block to seek, then read. The 
> advantage of pread is that it doesn't hold any locks, so multiple gets can 
> proceed at the same time. The advantage of seek+read for scans is that the 
> datanode starts to send the entire rest of the HDFS block, rather than just 
> the single hfile block necessary. So, in a single thread, pread is faster for 
> gets, and seek+read is faster for scans since you get a strong pipelining 
> effect.
> However, in a multi-threaded case where there are multiple scans (including 
> scans which are actually part of compactions), the seek+read strategy falls 
> apart, since only one scanner may be reading at a time. Additionally, a large 
> amount of wasted IO is generated on the datanode side, and we get none of the 
> earlier-mentioned advantages.
> In one test, I switched scans to always use pread, and saw a 5x improvement 
> in throughput of the YCSB scan-only workload, since it previously was 
> completely blocked by contention on the DFSIS lock.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5979) Non-pread DFSInputStreams should be associated with scanners, not HFile.Readers

Reply via email to