[ 
https://issues.apache.org/jira/browse/GRIFFIN-358?focusedWorklogId=619305&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619305
 ]

ASF GitHub Bot logged work on GRIFFIN-358:
------------------------------------------

                Author: ASF GitHub Bot
            Created on: 06/Jul/21 11:46
            Start Date: 06/Jul/21 11:46
    Worklog Time Spent: 10m 
      Work Description: chitralverma commented on pull request #591:
URL: https://github.com/apache/griffin/pull/591#issuecomment-874684760


   Thanks for the merge! :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@griffin.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 619305)
    Time Spent: 1h 40m  (was: 1.5h)

> Rewrite the Rule/Measure implementations
> ----------------------------------------
>
>                 Key: GRIFFIN-358
>                 URL: https://issues.apache.org/jira/browse/GRIFFIN-358
>             Project: Griffin
>          Issue Type: New Feature
>            Reporter: Chitral Verma
>            Assignee: Chitral Verma
>            Priority: Major
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Current `RuleParams` can be of the following 3 DSL types,
>  * Data Ops (for source preprocessing)
>  * Griffin DSL
>  * SparkSQL
> GriffinDSL allows the implementation of measures (DQ Types) like 
> Completeness, Accuracy, etc.
> To enable such measures there is an extensive implementation of expression, 
> task hierarchies, parsing and most of this is heavily dependent on 
> scala-parser-combinators.
> At the end of the implementation, Griffin DSL tries to mimic a SparkSQL-like 
> query but substitution of user-defined constraints.
> This approach has some drawbacks,
>  * Suboptimal processing. While the transformation steps execute in parallel 
> on the driver, the data set is still scanned multiple times in parallel which 
> can cause inefficiencies on the SparkSession side and the internal task 
> scheduler was single-threaded. Even though the data set can be cached, still 
> it branched and crucial memory is required for holding the dataset rather 
> than processing it.
>  * Internal functions of Spark are not used. Data preprocessing has a very 
> limited scope currently even though we have 100s spark SQL functions 
> available for use.
>  * This blocks structured streaming. The manually constructed SQL queries 
> cause multiple aggregations in the same query on a streaming data set which 
> is not supported by Spark's Structured streaming. There are workarounds for 
> this but they all require rewriting the *Expr2DQSteps classes.
>  * Griffin DSL is SparkSQL like but not 100% compatible. Profiling measure 
> and SparkSQL are redundant functionalities
> The proposed solution involves SparkSQL DSL based measures and some changes 
> to Rule Params. This will enhance the data pre proc flows and the measures 
> themselves



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to