[
https://issues.apache.org/jira/browse/HBASE-14789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999566#comment-14999566
]
Zhan Zhang commented on HBASE-14789:
------------------------------------
[~malaskat] Thanks for reviewing this. I agree that table write, json format,
and customized sedes are all able to fit into the current implementation with
some changes, and I would love to contribute on this part.
Given that, there is no straightforward way to support multi-scans and bulkget
with TableInputFormat with the scenario mentioned in the document. The
argument regarding the BulkGet may be reasonable in most cases, but there do
have chances that driver becomes bottleneck. More importantly, driver is
actually processing data with HBase client library. If there is any exception
happens, the whole job will crash.
Regarding the multiple scan, here is my understanding. Correct me if I am wrong.
Current implementation will construct a RDD for each non-overlapping scan. Then
all these RDDs are union together. With the current TableInputFormat
limitation, there is no easy way to walkaround this, unless TableInputFormat is
changed to handle multiple scans in one shot.
Now suppose we have 10 regions, but the user query may consists of 100
non-overlapping scans. Then there will be 100 RDDs constructed and union
together. The Union RDD returned by buildScan will consists of at least 100
partitions, assume that each RDD only have one partitions (there is high chance
that each RDD may consists of multiple partitions).
100 partitions means that Spark will has to launch 100 tasks to process the
scan. Given that we only have 10 regions (10 servers to simplify the
discussion). The scheduler cannot allocate 100 tasks to 10 severs (10
executors co-located with region server with 1 core each to simply the
discussion). In this scenario, the executors finishes its assigned task earlier
will get more tasks, which may retrieve data from other region server (hurt the
data locality). In addition, the scheduler has to schedule and serialize 100
tasks, which increase the overhead.
In the architecture proposed in this JIRA, the driver will construct 10 tasks
(or less) with each consists of multiple scans. These 10 tasks can be scheduled
to 10 executors concurrently, which achieve better data locality, sedes and
scheduling overhead.
In addition, in multi-tenant environment the current approach may suffer more
because it has to construct much more tasks but the overall executor slot is
limited.
Also in this architecture, the Scan and Get are treated in a unified way, which
seems to be more natural. I think in real deployment, the proposed architecture
does have its advantage in many scenarios.
> Provide an alternative spark-hbase connector
> --------------------------------------------
>
> Key: HBASE-14789
> URL: https://issues.apache.org/jira/browse/HBASE-14789
> Project: HBase
> Issue Type: Improvement
> Reporter: Zhan Zhang
> Assignee: Zhan Zhang
> Attachments: shc.pdf
>
>
> This JIRA is to provide user an option to choose different Spark-HBase
> implementation based on requirements.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)