[
https://issues.apache.org/jira/browse/GRIFFIN-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Azhar updated GRIFFIN-335:
--------------------------
Description:
*Background:*
Refer to [https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-334
|https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-332]and
https://issues.apache.org/jira/browse/GRIFFIN-333 .
If we have the ability to select specific columns, it will open the door to
use SQLbase aggregation, further reducing volume of data from Hive sources.
*Proposed Improvement:*
So, I propose the feature to allow Hive connector to able to use SQL based
aggregations.
Let's say we have source and target tables that have data like below.
src:
{code:java}
------------------------
|employee_id |country|
------------------------
|1 | NZ |
|2 | DE |
|3 | DE |
|4 | NZ |
|5 | DE |
....
....
------------------------
{code}
tgt:
{code:java}
------------------------
|total_employee|country|
------------------------
|10 | NZ |
|11 | DE |
------------------------
{code}
Then we can perform `accuracy` check [ `"rule":"src.total_employee =
tgt.total_employee and src.country = tgt.country "` ] directly like below
using `columns` and `groupby` clauses for source table:
{code:java}
{
"name":"src",
"connector":{
"type":"hive",
"config":{
"database":"mydatabase",
"table.name":"mytable",
"columns": "count(*) total_employee, country",
"groupby": "country",
"where":""
}
}
}
{code}
was:
*Background:*
Refer to [https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-334
|https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-332]and
https://issues.apache.org/jira/browse/GRIFFIN-333 .
If we have the ability to select specific columns, it will open the door to
use SQLbase aggregation, further reducing volume of data from Hive sources.
*Proposed Improvement:*
So, I propose the feature to allow Hive connector to able to use SQL based
aggregations.
Let's say we have source and target tables that have data like below.
src:
{code:java}
------------------------
|employee_id |country|
------------------------
|1 | NZ |
|2 | DE |
|3 | DE |
|4 | NZ |
|5 | DE |
....
....
------------------------
{code}
tgt:
{code:java}
------------------------
|total_employee|country|
------------------------
|10 | NZ |
|11 | DE |
------------------------
{code}
Then we can perform `accuracy` check directly like below using `columns` and
`groupby` clauses for source table:
{code:java}
{
"name":"src",
"connector":{
"type":"hive",
"config":{
"database":"mydatabase",
"table.name":"mytable",
"columns": "count(*) total_employee, country",
"groupby": "country",
"where":""
}
}
}
{code}
> Hive Connector: Ability to Use "group by" caluse
> ------------------------------------------------
>
> Key: GRIFFIN-335
> URL: https://issues.apache.org/jira/browse/GRIFFIN-335
> Project: Griffin
> Issue Type: Improvement
> Components: accuracy-batch
> Affects Versions: 0.6.0
> Reporter: Azhar
> Priority: Major
> Labels: columns, groupby, hive
>
> *Background:*
> Refer to [https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-334
> |https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-332]and
> https://issues.apache.org/jira/browse/GRIFFIN-333 .
> If we have the ability to select specific columns, it will open the door to
> use SQLbase aggregation, further reducing volume of data from Hive sources.
> *Proposed Improvement:*
> So, I propose the feature to allow Hive connector to able to use SQL based
> aggregations.
>
> Let's say we have source and target tables that have data like below.
> src:
> {code:java}
> ------------------------
> |employee_id |country|
> ------------------------
> |1 | NZ |
> |2 | DE |
> |3 | DE |
> |4 | NZ |
> |5 | DE |
> ....
> ....
> ------------------------
> {code}
> tgt:
> {code:java}
> ------------------------
> |total_employee|country|
> ------------------------
> |10 | NZ |
> |11 | DE |
> ------------------------
> {code}
> Then we can perform `accuracy` check [ `"rule":"src.total_employee =
> tgt.total_employee and src.country = tgt.country "` ] directly like below
> using `columns` and `groupby` clauses for source table:
> {code:java}
> {
> "name":"src",
> "connector":{
> "type":"hive",
> "config":{
> "database":"mydatabase",
> "table.name":"mytable",
> "columns": "count(*) total_employee, country",
> "groupby": "country",
> "where":""
> }
> }
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)