[ 
https://issues.apache.org/jira/browse/HBASE-8691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13708855#comment-13708855
 ] 

Sandy Pratt commented on HBASE-8691:
------------------------------------

Since my last comment, I've worked out some issues around integrating the 
streaming scan API with Hive, and I've pushed it out on an experimental basis 
to a production cluster for testing.  End-to-end, in a table scan situation, 
the streaming scan API turns out to be about 45% faster than the RPC scan API 
(a full table scan of my dataset took 31 about minutes with the streaming API 
versus about 45 minutes with the RCP API).

Some of the tweaking I had to do to get to that point:

- Refactored the streaming client scanner to conform to the standard 
AbstractClientScanner API (I had used an event-driven approach previously, 
where clients registered an "onMessage" interface and hit go)
- Created a TableInputFormat/RecordReader/etc. that leverages the new API
- Profiled my custom SerDe to iron out some surprising hotpots

As noted earlier, I looks like performance when streaming from the RS is highly 
dependent on keeping the pipe saturated.  Too much latency in any particular 
spot will cause bubbles, which kills performance.  As written, my SerDe was 
wasting too many cycles doing date formatting and initializing HashMaps (they 
seem to make a system call to srand).

Once I ironed those issues out, I did a comparison between the streaming scan 
API and the RPC scan API, both using the newly optimized SerDe, which is where 
I found the 45% performance improvement.  If it's true that latency is key to 
performance here, that delta might go up with more modern CPUs (I have Xeon 
5450s currently) as the overhead of Hive and the SerDe decrease relative to 
network speed.


                
> High-Throughput Streaming Scan API
> ----------------------------------
>
>                 Key: HBASE-8691
>                 URL: https://issues.apache.org/jira/browse/HBASE-8691
>             Project: HBase
>          Issue Type: Improvement
>          Components: Scanners
>    Affects Versions: 0.95.0
>            Reporter: Sandy Pratt
>              Labels: perfomance, scan
>         Attachments: HRegionServlet.java, README.txt, RecordReceiver.java, 
> ScannerTest.java, StreamHRegionServer.java, StreamReceiverDirect.java, 
> StreamServletDirect.java
>
>
> I've done some working testing various ways to refactor and optimize Scans in 
> HBase, and have found that performance can be dramatically increased by the 
> addition of a streaming scan API.  The attached code constitutes a proof of 
> concept that shows performance increases of almost 4x in some workloads.
> I'd appreciate testing, replication, and comments.  If the approach seems 
> viable, I think such an API should be built into some future version of HBase.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to