[
https://issues.apache.org/jira/browse/GRIFFIN-358?focusedWorklogId=619308&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619308
]
Chitral Verma logged work on GRIFFIN-358:
-----------------------------------------
Author: Chitral Verma
Created on: 06/Jul/21 11:47
Start Date: 06/Jul/21 11:47
Worklog Time Spent: 504h
Issue Time Tracking
-------------------
Worklog Id: (was: 619308)
Time Spent: 505h 40m (was: 1h 40m)
> Rewrite the Rule/Measure implementations
> ----------------------------------------
>
> Key: GRIFFIN-358
> URL: https://issues.apache.org/jira/browse/GRIFFIN-358
> Project: Griffin
> Issue Type: New Feature
> Reporter: Chitral Verma
> Assignee: Chitral Verma
> Priority: Major
> Time Spent: 505h 40m
> Remaining Estimate: 0h
>
> Current `RuleParams` can be of the following 3 DSL types,
> * Data Ops (for source preprocessing)
> * Griffin DSL
> * SparkSQL
> GriffinDSL allows the implementation of measures (DQ Types) like
> Completeness, Accuracy, etc.
> To enable such measures there is an extensive implementation of expression,
> task hierarchies, parsing and most of this is heavily dependent on
> scala-parser-combinators.
> At the end of the implementation, Griffin DSL tries to mimic a SparkSQL-like
> query but substitution of user-defined constraints.
> This approach has some drawbacks,
> * Suboptimal processing. While the transformation steps execute in parallel
> on the driver, the data set is still scanned multiple times in parallel which
> can cause inefficiencies on the SparkSession side and the internal task
> scheduler was single-threaded. Even though the data set can be cached, still
> it branched and crucial memory is required for holding the dataset rather
> than processing it.
> * Internal functions of Spark are not used. Data preprocessing has a very
> limited scope currently even though we have 100s spark SQL functions
> available for use.
> * This blocks structured streaming. The manually constructed SQL queries
> cause multiple aggregations in the same query on a streaming data set which
> is not supported by Spark's Structured streaming. There are workarounds for
> this but they all require rewriting the *Expr2DQSteps classes.
> * Griffin DSL is SparkSQL like but not 100% compatible. Profiling measure
> and SparkSQL are redundant functionalities
> The proposed solution involves SparkSQL DSL based measures and some changes
> to Rule Params. This will enhance the data pre proc flows and the measures
> themselves
--
This message was sent by Atlassian Jira
(v8.3.4#803005)