[
https://issues.apache.org/jira/browse/GRIFFIN-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17160653#comment-17160653
]
Obaidul Karim commented on GRIFFIN-333:
---------------------------------------
Hi [~tushar.patil]
I understand your standpoint.
Regardless of specific use cases. Let's think about my example above
description. If the src table is a 1TB in size and in RDBMS, with the current
design, the entire 1TB data will be transferred in the spark data frame[please
correct me if my understanding is wrong here]. Whatever the measure is,
profiling or correctness.
But all I need is to match the country-wise match between my source and target
tables. If we could aggregate it before load in a spark data frame, it will be
a lot of saving.
Just to tell you, yesterday I did a test. All I wanted to do it check the count
of records of a table between our operational DB and data lake. As there is no
aggregation supported in jdbc connector, I used correctness and join id columns
in the rule. I waited 6 hours to complete the loading from the source and ended
with EMR spot failure :(. If I use profiling it would be the same [if my
understanding is correct that it loads all the data on spark dataframe first
and then do other operations].
Please feel free to ask me more. Happy answer and eagerly waiting to use
Griffin in our production data lake.
And would love to contribute.
> JDBC Connector: Ability to Use "group by" caluse
> ------------------------------------------------
>
> Key: GRIFFIN-333
> URL: https://issues.apache.org/jira/browse/GRIFFIN-333
> Project: Griffin
> Issue Type: Improvement
> Components: accuracy-batch
> Affects Versions: 0.6.0
> Reporter: Obaidul Karim
> Priority: Major
> Labels: column, groupby, jdbc
>
> *Background:*
> Refer to [https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-332].
> If we have the ability to select specific columns, it will open the door to
> use sql base aggregation, further reducing the volume of data from JDBC
> sources.
>
> *Proposed Improvement:*
> So, I propose the feature to allow JDBC connector to able to use sql based
> aggregations using clause `groupby`
> *Example:*
> Let's say we have source and target tables that have data like below.
> src:
> {code:java}
> ------------------------
> |employee_id |country|
> ------------------------
> |1 | NZ |
> |2 | DE |
> |3 | DE |
> |4 | NZ |
> |5 | DE |
> ....
> ....
> ------------------------
> {code}
> tgt:
> {code:java}
> ------------------------
> |total_employee|country|
> ------------------------
> |10 | NZ |
> |11 | DE |
> ------------------------
> {code}
> Then we can perform `accuracy` check [ `"rule":"src.total_employee =
> tgt.total_employee and src.country = tgt.country "` ] directly like below
> using `columns` and `groupby` clauses for source table:
> {code:java}
> {
> "name":"src",
> "connector":{
> "type":"jdbc",
> "config":{
> "database":"mydatabase",
> "tablename":"mytable",
> "columns":"count(*) total_employee, country",
> "groupby":"country",
> "url":"jdbc:sqlserver://myhost:1433;databaseName=mydatabase",
> "user":"user",
> "password":"password",
> "driver":"com.microsoft.sqlserver.jdbc.SQLServerDriver",
> "where":""
> }
> }
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)