[jira] [Commented] (HIVE-20720) Add partition column option to JDBC handler

Jesus Camacho Rodriguez (JIRA) Thu, 11 Oct 2018 16:33:15 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-20720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647179#comment-16647179
 ]


Jesus Camacho Rodriguez commented on HIVE-20720:
------------------------------------------------

[~daijy], I believe current approach may cause problems. Assume a table 'tab' 
with columns 'a', 'b', and 'c'. In turn, column 'c' is the partition column. 
Then user (or Calcite) defines a query:
{code:sql}
SELECT a, b FROM tab;
{code}
Unless I am mistaken, we will fail when we add the partition column predicate 
with current approach, since we are doing:
{code:sql}
SELECT * FROM (SELECT a, b FROM tab) temp WHERE temp.c < z and temp.c > y;
{code}
My proposal was to try wrap the table, as this will be more general and work 
with all Project/Filter queries:
{code:sql}
SELECT a, b FROM (SELECT * FROM tab WHERE temp.c < z and temp.c > y) tab;
{code}
Though maybe to do that, we need to generate an AST from the SQL. Or another 
option would be to let user specify the table name, then we just need to find 
the {{...from tabName}} pattern. What do you think?

> Add partition column option to JDBC handler
> -------------------------------------------
>
>                 Key: HIVE-20720
>                 URL: https://issues.apache.org/jira/browse/HIVE-20720
>             Project: Hive
>          Issue Type: New Feature
>          Components: StorageHandler
>            Reporter: Daniel Dai
>            Assignee: Daniel Dai
>            Priority: Major
>         Attachments: HIVE-20720.1.patch, HIVE-20720.2.patch, 
> HIVE-20720.3.patch, HIVE-20720.4.patch
>
>
> Currently JdbcStorageHandler does not split input in Tez. The reason is 
> numSplit of JdbcInputFormat.getSplits can only pass via "mapreduce.job.maps" 
> in Tez. And "mapreduce.job.maps" is not a valid param if authorizer(eg. 
> SQLStdAuth) is in use. User ends up always use 1 split.
> We need to rely on this new feature if we want to support multi-splits. Here 
> is my proposal:
> 1. Specify partitionColumn/numPartitions, and optional lowerBound/upperBound 
> in tblproperties if user want to split jdbc data source. In case 
> lowerBound/upperBound is not specified, JdbcStorageHandler will run max/min 
> query to get this in planner. We can currently limit partitionColumn to only 
> numeric/date/timestamp column for simplicity
> 2. If partitionColumn/numPartitions are not specified, don't split input
> 3. Splits are equal intervals without respect to data distribution
> 4. There is also a "hive.sql.query.split" flag vetos the split (can be set 
> manually or automatically by calcite)
> 5. If partitionColumn is not defined, but numPartitions is defined, use 
> original limit/offset logic (however, don't rely on numSplit).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20720) Add partition column option to JDBC handler

Reply via email to