[jira] [Commented] (GRIFFIN-335) Hive Connector: Ability to Use "group by" caluse

Obaidul Karim (Jira) Sat, 18 Jul 2020 11:23:02 -0700


    [ 
https://issues.apache.org/jira/browse/GRIFFIN-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17160518#comment-17160518
 ]


Obaidul Karim commented on GRIFFIN-335:
---------------------------------------

Hi [~tushar.patil]

Thanks for the comment and suggestion.

If I want to measure the correctness of data by matching between source and 
target, profiling is not a good option. It will need additional steps outside 
of Griffin to match it again. It will reduce usability of Griffin.

Adding aggregation options will improve usability of Griffin as it will reduce 
data transfer which leads to reduce cost as well.

So, why don't we add this option, especially when it could be done with minimum 
changes in the code :)

(same comment as in GRIFFIN-333 )

> Hive Connector: Ability to Use "group by" caluse
> ------------------------------------------------
>
>                 Key: GRIFFIN-335
>                 URL: https://issues.apache.org/jira/browse/GRIFFIN-335
>             Project: Griffin
>          Issue Type: Improvement
>          Components: accuracy-batch
>    Affects Versions: 0.6.0
>            Reporter: Azhar
>            Priority: Major
>              Labels: columns, groupby, hive
>
> *Background:*
> Refer to [https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-334 
> |https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-332]and 
> https://issues.apache.org/jira/browse/GRIFFIN-333 .
>  If we have the ability to select specific columns, it will open the door to 
> use SQLbase aggregation, further reducing volume of data from Hive sources.
> *Proposed Improvement:*
>  So, I propose the feature to allow Hive connector to able to use SQL based 
> aggregations.
>  
> Let's say we have source and target tables that have data like below.
> src:
> {code:java}
> ------------------------
> |employee_id   |country|
> ------------------------
> |1             | NZ    |
> |2             | DE    |
> |3             | DE    |
> |4             | NZ    |
> |5             | DE    |
> ....
> ....
> ------------------------
> {code}
> tgt:
> {code:java}
> ------------------------
> |total_employee|country|
> ------------------------
> |10            | NZ    |
> |11            | DE    |
> ------------------------
> {code}
> Then we can perform `accuracy` check [ `"rule":"src.total_employee = 
> tgt.total_employee and src.country = tgt.country "` ]  directly  like below 
> using `columns` and `groupby` clauses for source table:
> {code:java}
>       {
>          "name":"src",
>          "connector":{
>             "type":"hive",
>             "config":{
>                "database":"mydatabase",
>                "table.name":"mytable",
>                "columns": "count(*) total_employee, country",
>                "groupby": "country",
>                "where":""
>             }
>          }
>       }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GRIFFIN-335) Hive Connector: Ability to Use "group by" caluse

Reply via email to