Thanks Grant.

I agree Griffin-DSL should leverage spark-sql for sql part , and
Griffin-DSL should work as DQ layer to assemble different dimensions as
MLlib does.
Since we already have some experiences for data quality domain, it is now
for Griffin-DSL to evolve to next level.

Thanks,
William


On Wed, Jan 30, 2019 at 5:48 AM Grant <[email protected]> wrote:

> Hi all,
>
> I would suggest simplifying Griffin-DSL.
>
> Currently, Griffin supports three types of DSL: spark-sql, griffin-dsl and
> df-ops respectively. In this proposal, I only focus on the first two.
>
> Griffin-DSL is a SQL-like language, supporst a wide range of clauses, key
> words, operators etc as Spark SQL. class "GriffinDslParser" also defines
> how to parse the SQL-like syntax. Actually, Griffin-DSL's SQL-like syntax
> could be covered by Spark SQL completely. Spark 2.0 substantially improved
> SQL functionalities with SQL2003 support and can now run all 99 TPC-DS
> queries.
>
> So is it possible for Griffin-DSL to remove all SQL-like language features?
> All rules, which could be expressed by SQL, would be categorized as
> "spark-sql" DSL type instead of "griffin-dsl". In this case, we could
> simplify the implementation of Griffin-DSL.
>
> For my understanding, Griffin-DSL should be the high-order expressions,
> each of them represents a specific set of semantics. Griffin-DSL continues
> focusing on the expressions with the richer semantics in data exploration
> or wrangling area, and leaves all SQL compatible expressions to Spark SQL.
> Griffin-DSL is still translated into Spark-SQL when being executed.
>
> here is an example from the unit test "_accuracy-batch-griffindsl.json"
>
> "evaluate.rule": {
>     "rules": [
>       {
>         "dsl.type": "griffin-dsl",
>         "dq.type": "accuracy",
>         "out.dataframe.name": "accu",
>         "rule": "source.user_id = target.user_id AND
> upper(source.first_name) = upper(target.first_name) AND source.last_name =
> target.last_name AND source.address = target.address AND source.email =
> target.email AND source.phone = target.phone AND source.post_code =
> target.post_code",
>         "details": {
>           "source": "source",
>           "target": "target",
>           "miss": "miss_count",
>           "total": "total_count",
>           "matched": "matched_count"
>         },
>         "out":[
>           {
>             "type": "record",
>             "name": "missRecords"
>           }
>         ]
>       }
>     ]
>   }
>
>   If we move SQL-like syntax out of Griffin-DSL, the preceding example will
> take "dsl.type" as "spark-sql", and "rule" would be probably a list of
> columns or all columns by default.
>
>   Discussions are welcomed.
>
> Grant
>

Reply via email to