[ 
https://issues.apache.org/jira/browse/SPARK-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-15874:
-------------------------------
    Summary: HBase rowkey optimization support for Hbase-Storage-handler  (was: 
HBase rowkey optimization support for Hbase-handler)

> HBase rowkey optimization support for Hbase-Storage-handler
> -----------------------------------------------------------
>
>                 Key: SPARK-15874
>                 URL: https://issues.apache.org/jira/browse/SPARK-15874
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Weichen Xu
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> Currently, Spark-SQL use  `org.apache.hadoop.hive.hbase.HBaseStorageHandler` 
> for Hbase table support, which has poor optimization. for example, query such 
> as
> select * from hbase_tab1 where rowkey_col = 'abc';
> will cause full table scan(each table region turn into a scan split and do 
> full region scan).
> In fact, it is easy to implement the following optimization:
> 1.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc';`
> or
> `select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or 
> ...;`
> can use hbase rowkey `Get`/`multiGet` API to execute efficiently.
> 2.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc%';`
> can use hbase rowkey `Scan` API to execute efficiently.
> Higher-level SQL optimization will benefit from such optimization, for 
> example, there is a very small table(such as incremental Data) `small_tab1`,
> SQL such as
> `select * from small_tab1 join hbase_tab1 on small_tab1.key1 = 
> hbase_tab1.rowkey_col`
> can use classic small-table driven join optimization:
> loop each record of small_tab1, and exact each small_tab1.key1 as 
> hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently.
> The scenario described above is very common, manay business system may have 
> several tables which has main-key such as userID, and they often
> store them in HBase. But, several times people have requirement to do some 
> analysis with SQL, and these SQL will have good optimization if the SQL 
> execution plan has a good support to HBase rowkey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to