[jira] [Work logged] (GRIFFIN-358) Rewrite the Rule/Measure implementations

ASF GitHub Bot (Jira) Fri, 28 May 2021 13:52:08 -0700


     [ 
https://issues.apache.org/jira/browse/GRIFFIN-358?focusedWorklogId=603701&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-603701
 ]


ASF GitHub Bot logged work on GRIFFIN-358:
------------------------------------------

                Author: ASF GitHub Bot
            Created on: 28/May/21 20:51
            Start Date: 28/May/21 20:51
    Worklog Time Spent: 10m 
      Work Description: chitralverma opened a new pull request #591:
URL: https://github.com/apache/griffin/pull/591


   **What changes were proposed in this pull request?**
   
   
   Current `RuleParams` can be of the following 3 DSL types,
   
   - Data Ops (for source preprocessing)
   - Griffin DSL
   - SparkSQL
   
   GriffinDSL allows the implementation of measures (DQ Types) like 
Completeness, Accuracy, etc.
   
   To enable such measures there is an extensive implementation of expression, 
task hierarchies, parsing and most of this is heavily dependent on 
scala-parser-combinators.
   
   At the end of the implementation, Griffin DSL tries to mimic a SparkSQL-like 
query but substitution of user-defined constraints.
   
   This approach has some drawbacks,
   
   - Suboptimal processing. While the transformation steps execute in parallel 
on the driver, the data set is still scanned multiple times in parallel which 
can cause inefficiencies on the SparkSession side and the internal task 
scheduler was single-threaded. Even though the data set can be cached, still it 
branched and crucial memory is required for holding the dataset rather than 
processing it.
   - Internal functions of Spark are not used. Data preprocessing has a very 
limited scope currently even though we have 100s spark SQL functions available 
for use.
   - This blocks structured streaming. The manually constructed SQL queries 
cause multiple aggregations in the same query on a streaming data set which is 
not supported by Spark's Structured streaming. There are workarounds for this 
but they all require rewriting the *Expr2DQSteps classes.
   - Griffin DSL is SparkSQL like but not 100% compatible. Profiling measure 
and SparkSQL are redundant functionalities
   
   The proposed solution involves SparkSQL DSL based measures and some changes 
to Rule Params. This will enhance the data pre proc flows and the measures 
themselves
   
   
   **Does this PR introduce any user-facing change?**
   Yes. Users can use the new measures as a separate configuration and there is 
scope for more data pre-processing.
   
   **How was this patch tested?**
   Unit Tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 603701)
    Remaining Estimate: 0h
            Time Spent: 10m

> Rewrite the Rule/Measure implementations
> ----------------------------------------
>
>                 Key: GRIFFIN-358
>                 URL: https://issues.apache.org/jira/browse/GRIFFIN-358
>             Project: Griffin
>          Issue Type: New Feature
>            Reporter: Chitral Verma
>            Assignee: Chitral Verma
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Current `RuleParams` can be of the following 3 DSL types,
>  * Data Ops (for source preprocessing)
>  * Griffin DSL
>  * SparkSQL
> GriffinDSL allows the implementation of measures (DQ Types) like 
> Completeness, Accuracy, etc.
> To enable such measures there is an extensive implementation of expression, 
> task hierarchies, parsing and most of this is heavily dependent on 
> scala-parser-combinators.
> At the end of the implementation, Griffin DSL tries to mimic a SparkSQL-like 
> query but substitution of user-defined constraints.
> This approach has some drawbacks,
>  * Suboptimal processing. While the transformation steps execute in parallel 
> on the driver, the data set is still scanned multiple times in parallel which 
> can cause inefficiencies on the SparkSession side and the internal task 
> scheduler was single-threaded. Even though the data set can be cached, still 
> it branched and crucial memory is required for holding the dataset rather 
> than processing it.
>  * Internal functions of Spark are not used. Data preprocessing has a very 
> limited scope currently even though we have 100s spark SQL functions 
> available for use.
>  * This blocks structured streaming. The manually constructed SQL queries 
> cause multiple aggregations in the same query on a streaming data set which 
> is not supported by Spark's Structured streaming. There are workarounds for 
> this but they all require rewriting the *Expr2DQSteps classes.
>  * Griffin DSL is SparkSQL like but not 100% compatible. Profiling measure 
> and SparkSQL are redundant functionalities
> The proposed solution involves SparkSQL DSL based measures and some changes 
> to Rule Params. This will enhance the data pre proc flows and the measures 
> themselves



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (GRIFFIN-358) Rewrite the Rule/Measure implementations

Reply via email to