[
https://issues.apache.org/jira/browse/HBASE-8691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13708855#comment-13708855
]
Sandy Pratt commented on HBASE-8691:
------------------------------------
Since my last comment, I've worked out some issues around integrating the
streaming scan API with Hive, and I've pushed it out on an experimental basis
to a production cluster for testing. End-to-end, in a table scan situation,
the streaming scan API turns out to be about 45% faster than the RPC scan API
(a full table scan of my dataset took 31 about minutes with the streaming API
versus about 45 minutes with the RCP API).
Some of the tweaking I had to do to get to that point:
- Refactored the streaming client scanner to conform to the standard
AbstractClientScanner API (I had used an event-driven approach previously,
where clients registered an "onMessage" interface and hit go)
- Created a TableInputFormat/RecordReader/etc. that leverages the new API
- Profiled my custom SerDe to iron out some surprising hotpots
As noted earlier, I looks like performance when streaming from the RS is highly
dependent on keeping the pipe saturated. Too much latency in any particular
spot will cause bubbles, which kills performance. As written, my SerDe was
wasting too many cycles doing date formatting and initializing HashMaps (they
seem to make a system call to srand).
Once I ironed those issues out, I did a comparison between the streaming scan
API and the RPC scan API, both using the newly optimized SerDe, which is where
I found the 45% performance improvement. If it's true that latency is key to
performance here, that delta might go up with more modern CPUs (I have Xeon
5450s currently) as the overhead of Hive and the SerDe decrease relative to
network speed.
> High-Throughput Streaming Scan API
> ----------------------------------
>
> Key: HBASE-8691
> URL: https://issues.apache.org/jira/browse/HBASE-8691
> Project: HBase
> Issue Type: Improvement
> Components: Scanners
> Affects Versions: 0.95.0
> Reporter: Sandy Pratt
> Labels: perfomance, scan
> Attachments: HRegionServlet.java, README.txt, RecordReceiver.java,
> ScannerTest.java, StreamHRegionServer.java, StreamReceiverDirect.java,
> StreamServletDirect.java
>
>
> I've done some working testing various ways to refactor and optimize Scans in
> HBase, and have found that performance can be dramatically increased by the
> addition of a streaming scan API. The attached code constitutes a proof of
> concept that shows performance increases of almost 4x in some workloads.
> I'd appreciate testing, replication, and comments. If the approach seems
> viable, I think such an API should be built into some future version of HBase.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira