[ 
https://issues.apache.org/jira/browse/GRIFFIN-334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Azhar updated GRIFFIN-334:
--------------------------
    Description: 
*Background:*
 Refer to https://issues.apache.org/jira/browse/GRIFFIN-332 , we would like 
same feature for Hive as well.
 Currently it is pulling all the columns `"SELECT * FROM $fullTableName"`.
 It will cause some issues for larger Hive tables –
 - memory overhead for spark dataframe
 - longer execution time

*Proposed Feature:*
 So, I propose the feature to allow Hive connector to be able to select only 
required columns.

*Example:*
 We have a rule `"rule":"src.id = tgt.id and src.country = tgt.country "`. Then 
we only need two columns `id` and 'country'.
 So, in connector we can add additional key word `columns` to select only 
required columns, like below: 
{code:java}
    {
         "name":"src",         
         "connector":{
            "type":"hive",
            "config":{
               "database":"mydatabase",
               "table.name":"mytable",
               "columns": "id, country",
               "where":""
            }
         }
    }
{code}
We can implement it like this, if there is `columns` clause then use it 
otherwise use `*` as default.

  was:
*Background:*
 Refer to https://issues.apache.org/jira/browse/GRIFFIN-332 , we would like 
same feature for Hive as well.
Currently it is pulling all the columns `"SELECT * FROM $fullTableName"`.
It will cause some issues for larger Hive tables --
 - memory overhead for spark dataframe
 - longer execution time


*Proposed Feature:*
So, I propose the feature to allow Hive connector to able to select only 
required columns.

*Example:*
We have a rule `"rule":"src.id = tgt.id and src.country = tgt.country "`. Then 
we only need two columns `id` and 'country'.
So, in connector we can add additional key word `columns` to select only 
required columns, like below: 
{code:java}
    {
         "name":"src",         
         "connector":{
            "type":"hive",
            "config":{
               "database":"mydatabase",
               "table.name":"mytable",
               "columns": "id, country",
               "where":""
            }
         }
    }
{code}
We can implement it like this, if there is `columns` clause then use it 
otherwise use `*` as default.

        Summary: Hive Connector: Ability to Select Specific Columns Instead of 
All the Columns  (was: HIve Connector: Ability to Select Specific Columns 
Instead of All the Columns)

> Hive Connector: Ability to Select Specific Columns Instead of All the Columns
> -----------------------------------------------------------------------------
>
>                 Key: GRIFFIN-334
>                 URL: https://issues.apache.org/jira/browse/GRIFFIN-334
>             Project: Griffin
>          Issue Type: Improvement
>          Components: accuracy-batch
>    Affects Versions: 0.6.0
>            Reporter: Azhar
>            Priority: Major
>              Labels: columns, hive
>
> *Background:*
>  Refer to https://issues.apache.org/jira/browse/GRIFFIN-332 , we would like 
> same feature for Hive as well.
>  Currently it is pulling all the columns `"SELECT * FROM $fullTableName"`.
>  It will cause some issues for larger Hive tables –
>  - memory overhead for spark dataframe
>  - longer execution time
> *Proposed Feature:*
>  So, I propose the feature to allow Hive connector to be able to select only 
> required columns.
> *Example:*
>  We have a rule `"rule":"src.id = tgt.id and src.country = tgt.country "`. 
> Then we only need two columns `id` and 'country'.
>  So, in connector we can add additional key word `columns` to select only 
> required columns, like below: 
> {code:java}
>     {
>          "name":"src",         
>          "connector":{
>             "type":"hive",
>             "config":{
>                "database":"mydatabase",
>                "table.name":"mytable",
>                "columns": "id, country",
>                "where":""
>             }
>          }
>     }
> {code}
> We can implement it like this, if there is `columns` clause then use it 
> otherwise use `*` as default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to