[ 
https://issues.apache.org/jira/browse/HBASE-14789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999566#comment-14999566
 ] 

Zhan Zhang commented on HBASE-14789:
------------------------------------

[~malaskat] Thanks for reviewing this. I agree that  table write, json format, 
and customized sedes are all able to fit into the current implementation with 
some changes, and I would love to contribute on this part.

Given that, there is no straightforward way to support multi-scans and bulkget 
with TableInputFormat with the scenario mentioned in the document.  The 
argument regarding the BulkGet may be reasonable in most cases, but there do 
have chances that driver becomes bottleneck. More importantly,  driver is 
actually processing data with HBase client library. If there is any exception 
happens, the whole job will crash.

Regarding the multiple scan, here is my understanding. Correct me if I am wrong.
Current implementation will construct a RDD for each non-overlapping scan. Then 
all these RDDs are union together. With the current TableInputFormat 
limitation, there is no easy way to walkaround this, unless TableInputFormat is 
changed to handle multiple scans in one shot.

Now suppose we have 10 regions, but the user query may consists of 100 
non-overlapping scans. Then there will be 100 RDDs constructed and union 
together.  The Union RDD returned by buildScan will consists of at least 100 
partitions, assume that each RDD only have one partitions (there is high chance 
that each RDD may consists of multiple partitions).

100 partitions means that Spark will has to launch 100 tasks to process the 
scan. Given that we only have 10 regions (10 servers to simplify the 
discussion).  The scheduler cannot allocate 100 tasks to 10 severs (10 
executors co-located with region server with 1 core each to simply the 
discussion). In this scenario, the executors finishes its assigned task earlier 
will get more tasks, which may retrieve data from other region server (hurt the 
data locality). In addition, the scheduler has to schedule and serialize 100 
tasks, which increase the overhead.

In the architecture proposed in this JIRA, the driver will construct 10 tasks 
(or less) with each consists of multiple scans. These 10 tasks can be scheduled 
to 10 executors concurrently, which achieve better data locality, sedes and 
scheduling overhead.

In addition, in multi-tenant environment the current approach may suffer more 
because it has to construct much more tasks but the overall executor slot is 
limited.

Also in this architecture, the Scan and Get are treated in a unified way, which 
seems to be more natural. I think in real deployment, the proposed architecture 
does have its advantage in many scenarios. 



> Provide an alternative spark-hbase connector
> --------------------------------------------
>
>                 Key: HBASE-14789
>                 URL: https://issues.apache.org/jira/browse/HBASE-14789
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Zhan Zhang
>            Assignee: Zhan Zhang
>         Attachments: shc.pdf
>
>
> This JIRA is to provide user an option to choose different Spark-HBase 
> implementation based on requirements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to