[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353865#comment-14353865
 ] 

Eshcar Hillel commented on HBASE-13071:
---------------------------------------

A new patch is attached following the comments by [~jonathan.lawlor] and 
[~stack].

Some notes on implementation and design:
  * The default value is now set to async. (btw, this means async scanner is 
used in multiple tests, which used to have sync scan.)
  * The responsibility to invoke super.close() is now shifted to the pending 
prefetch thread, so it is not missed.
  * In case of sync scanner, the caching parameter indicates both the size of 
the buffer and the chunk size (#rows fetched). In case of async scanner, the 
parameter only indicates the later, while the buffer size is doubled. This 
should now be clear from the documentation, as well as from the new methods 
getCacheCapacity() and getThresholdSize().
  * cache and caching were members of ClientScanner even before this patch. I 
only added the abstract initCache() method. I agree that having two abstract 
classes is not the cleanest solution, but neither is having initCache() in a 
class where not all subclasses have a cache. As I said before, this hierarchy 
can benefit from some re-factoring (the right design might use composition like 
in the strategy pattern instead of inheritance, but all these decisions should 
not be in the scope of the current Jira).

Some notes on performance:
  * This feature is a client side feature and therefore should be tested in 
terms of client side latency.
  * This feature should reduce the latency, and in worse case scenario should 
not increase it (at least not significantly)
  * On the server side I would expect the same behavior as in sync scanner, 
since the same RPC calls are invoked, they only shift earlier in time to have 
the data ready at the client side before the user needs it.   
  * I cannot explain the behavior of the low humps in your test. Do you see 
this consistently? What is the exact setting? Is it a fixed number of scans or 
a fixed time?  

> Hbase Streaming Scan Feature
> ----------------------------
>
>                 Key: HBASE-13071
>                 URL: https://issues.apache.org/jira/browse/HBASE-13071
>             Project: HBase
>          Issue Type: New Feature
>    Affects Versions: 0.98.11
>            Reporter: Eshcar Hillel
>         Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
> HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, 
> HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, 
> HBASE-13071_trunk_5.patch, HBaseStreamingScanDesign.pdf, 
> HbaseStreamingScanEvaluation.pdf, gc.eshcar.png, hits.eshcar.png, network.png
>
>
> A scan operation iterates over all rows of a table or a subrange of the 
> table. The synchronous nature in which the data is served at the client side 
> hinders the speed the application traverses the data: it increases the 
> overall processing time, and may cause a great variance in the times the 
> application waits for the next piece of data.
> The scanner next() method at the client side invokes an RPC to the 
> regionserver and then stores the results in a cache. The application can 
> specify how many rows will be transmitted per RPC; by default this is set to 
> 100 rows. 
> The cache can be considered as a producer-consumer queue, where the hbase 
> client pushes the data to the queue and the application consumes it. 
> Currently this queue is synchronous, i.e., blocking. More specifically, when 
> the application consumed all the data from the cache --- so the cache is 
> empty --- the hbase client retrieves additional data from the server and 
> re-fills the cache with new data. During this time the application is blocked.
> Under the assumption that the application processing time can be balanced by 
> the time it takes to retrieve the data, an asynchronous approach can reduce 
> the time the application is waiting for data.
> We attach a design document.
> We also have a patch that is based on a private branch, and some evaluation 
> results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to