[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eshcar Hillel updated HBASE-13071:
----------------------------------
    Release Note: 
MOTIVATION

A pipelined scan API is introduced for speeding up applications that combine 
massive data traversal with compute-intensive processing. Traditional HBase 
scans save network trips through prefetching the data to the client side cache. 
However, they prefetch synchronously: the fetch request to regionserver is 
invoked only when the entire cache is consumed. This leads to a stop-and-wait 
access pattern, in which the client stalls until the next chunk of data is 
fetched. Applications that do significant processing can benefit from 
background data prefetching, which eliminates this bottleneck. The pipelined 
scan implementation overlaps the cache population at the client side with 
application processing. Namely, it issues a new scan RPC when the iteration 
retrieves 50% of the cache. If the application processing (that is, the time 
between invocations of next()) is substantial, the new chunk of data will be 
available before the previous one is exhausted, and the client will not 
experience any delay. Ideally, the prefetch and the processing times should be 
balanced. 

API AND CONFIGURATION

Asynchronous scanning can be configured either globally for all tables and 
scans, or on per-scan basis via a new Scan class API. 

Configuration in hbase-site.xml: hbase.client.scanner.async.prefetch, default 
false:

 <property>
   <name>hbase.client.scanner.async.prefetch</name>
   <value>true</value>
 </property>

API - Scan#setAsyncPrefetch(boolean)

      Scan scan = new Scan();
      scan.setCaching(1000);
      scan.setMaxResultSize(BIG_SIZE);
      scan.setAsyncPrefetch(true);
        ...
      ResultScanner scanner = table.getScanner(scan);

IMPLEMENTATION NOTES

Pipelined scan is implemented by a new ClientAsyncPrefetchScanner class, which 
is fully API-compatible with the synchronous ClientSimpleScanner. 
ClientAsyncPrefetchScanner is not instantiated in case of small (Scan#setSmall) 
and reversed (Scan#setReversed) scanners. The application is responsible for 
setting the prefetch size in a way that the prefetch time and the processing 
times are balanced. Note that due to double buffering, the client side cache 
can use twice as much memory as the synchronous scanner.

Generally, this feature will put more load on the server (higher fetch rate -- 
which is the whole point).  Also, YMMV.

  was:
MOTIVATION

A pipelined scan API is introduced for speeding up applications that combine 
massive data traversal with compute-intensive processing. Traditional HBase 
scans save network trips through prefetching the data to the client side cache. 
However, they prefetch synchronously: the fetch request to regionserver is 
invoked only when the entire cache is consumed. This leads to a stop-and-wait 
access pattern, in which the client stalls until the next chunk of data is 
fetched. Applications that do significant processing can benefit from 
background data prefetching, which eliminates this bottleneck. The pipelined 
scan implementation overlaps the cache population at the client side with 
application processing. Namely, it issues a new scan RPC when the iteration 
retrieves 50% of the cache. If the application processing (that is, the time 
between invocations of next()) is substantial, the new chunk of data will be 
available before the previous one is exhausted, and the client will not 
experience any delay. Ideally, the prefetch and the processing times should be 
balanced. 

API AND CONFIGURATION

Asynchronous scanning can be configured either globally for all tables and 
scans, or on per-scan basis via a new Scan class API. 

Configuration in hbase-site.xml: hbase.client.scanner.async.prefetch, default 
false:

 <property>
   <name>hbase.client.scanner.async.prefetch</name>
   <value>true</value>
 </property>

API - Scan#setAsyncPrefetch(boolean)

      Scan scan = new Scan();
      scan.setCaching(1000);
      scan.getMaxResultSize(BIG_SIZE);
      scan.setAsyncPrefetch(true);
        ...
      ResultScanner scanner = table.getScanner(scan);

IMPLEMENTATION NOTES

Pipelined scan is implemented by a new ClientAsyncPrefetchScanner class, which 
is fully API-compatible with the synchronous ClientSimpleScanner. 
ClientAsyncPrefetchScanner is not instantiated in case of small (Scan#setSmall) 
and reversed (Scan#setReversed) scanners. The application is responsible for 
setting the prefetch size in a way that the prefetch time and the processing 
times are balanced. Note that due to double buffering, the client side cache 
can use twice as much memory as the synchronous scanner.

Generally, this feature will put more load on the server (higher fetch rate -- 
which is the whole point).  Also, YMMV.


> Hbase Streaming Scan Feature
> ----------------------------
>
>                 Key: HBASE-13071
>                 URL: https://issues.apache.org/jira/browse/HBASE-13071
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Eshcar Hillel
>            Assignee: Eshcar Hillel
>             Fix For: 2.0.0
>
>         Attachments: 99.eshcar.png, HBASE-13071-0_98.patch, 
> HBASE-13071-BRANCH-1.patch, HBASE-13071-trunk-bug-fix.patch, 
> HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, 
> HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
> HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, 
> gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, 
> hits.png, latency.delay.png, latency.png, network.png
>
>
> A scan operation iterates over all rows of a table or a subrange of the 
> table. The synchronous nature in which the data is served at the client side 
> hinders the speed the application traverses the data: it increases the 
> overall processing time, and may cause a great variance in the times the 
> application waits for the next piece of data.
> The scanner next() method at the client side invokes an RPC to the 
> regionserver and then stores the results in a cache. The application can 
> specify how many rows will be transmitted per RPC; by default this is set to 
> 100 rows. 
> The cache can be considered as a producer-consumer queue, where the hbase 
> client pushes the data to the queue and the application consumes it. 
> Currently this queue is synchronous, i.e., blocking. More specifically, when 
> the application consumed all the data from the cache --- so the cache is 
> empty --- the hbase client retrieves additional data from the server and 
> re-fills the cache with new data. During this time the application is blocked.
> Under the assumption that the application processing time can be balanced by 
> the time it takes to retrieve the data, an asynchronous approach can reduce 
> the time the application is waiting for data.
> We attach a design document.
> We also have a patch that is based on a private branch, and some evaluation 
> results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to