[jira] [Updated] (HBASE-13071) Hbase Streaming Scan Feature

stack (JIRA) Tue, 03 Mar 2015 15:53:33 -0800

     [ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


stack updated HBASE-13071:
--------------------------
    Attachment: network.png
                gc.eshcar.png
                hits.eshcar.png
                99.eshcar.png

Here are the pictures.

The way to read them is that the four humps on the left are from previous runs. 
 The first hump is current branch-1.0.  The second is branch-1.0 with 
HBASE-13082 as is the third hump. The fourth hump is branch-1.0 unadorned 
again. The two humps on the right hand side are your patch on branch-1.0

The tall humps are a test that has 5 clients on 5 different machines each 
running 10 clients against a single regionserver (the regionserver cannot go 
faster; it is at a ceiling whose nature is TBD).

The low humps are a single client with two threads.

The dataset is 100M rows of 10 columns whose size is zipfian distributed 
between 0 and 8k. On average, a row is about 200k.

So, interesting observations:

+ We start out good. Net traffic is up and so are requests but after a while, 
net traffic and requests falls down to what it was before.  Do we hit a cadence 
where the rhythm of fetch is effectively what it was without patch?  What you 
think?  Should we be more aggressive prefetching?
+ For the light test of two clients only, we run much slower than the unpatched 
version. Whats that about do you think?

This is patch v4. Scan gets rows in batches of 30.

If you want me to try an instrumented version, no problem. I'll have this 
little rig for another few days.  Thanks.

> Hbase Streaming Scan Feature
> ----------------------------
>
>                 Key: HBASE-13071
>                 URL: https://issues.apache.org/jira/browse/HBASE-13071
>             Project: HBase
>          Issue Type: New Feature
>    Affects Versions: 0.98.11
>            Reporter: Eshcar Hillel
>         Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
> HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, 
> HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, 
> HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
> gc.eshcar.png, hits.eshcar.png, network.png
>
>
> A scan operation iterates over all rows of a table or a subrange of the 
> table. The synchronous nature in which the data is served at the client side 
> hinders the speed the application traverses the data: it increases the 
> overall processing time, and may cause a great variance in the times the 
> application waits for the next piece of data.
> The scanner next() method at the client side invokes an RPC to the 
> regionserver and then stores the results in a cache. The application can 
> specify how many rows will be transmitted per RPC; by default this is set to 
> 100 rows. 
> The cache can be considered as a producer-consumer queue, where the hbase 
> client pushes the data to the queue and the application consumes it. 
> Currently this queue is synchronous, i.e., blocking. More specifically, when 
> the application consumed all the data from the cache --- so the cache is 
> empty --- the hbase client retrieves additional data from the server and 
> re-fills the cache with new data. During this time the application is blocked.
> Under the assumption that the application processing time can be balanced by 
> the time it takes to retrieve the data, an asynchronous approach can reduce 
> the time the application is waiting for data.
> We attach a design document.
> We also have a patch that is based on a private branch, and some evaluation 
> results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-13071) Hbase Streaming Scan Feature

Reply via email to