Thanks Grant. I agree Griffin-DSL should leverage spark-sql for sql part , and Griffin-DSL should work as DQ layer to assemble different dimensions as MLlib does. Since we already have some experiences for data quality domain, it is now for Griffin-DSL to evolve to next level.
Thanks, William On Wed, Jan 30, 2019 at 5:48 AM Grant <[email protected]> wrote: > Hi all, > > I would suggest simplifying Griffin-DSL. > > Currently, Griffin supports three types of DSL: spark-sql, griffin-dsl and > df-ops respectively. In this proposal, I only focus on the first two. > > Griffin-DSL is a SQL-like language, supporst a wide range of clauses, key > words, operators etc as Spark SQL. class "GriffinDslParser" also defines > how to parse the SQL-like syntax. Actually, Griffin-DSL's SQL-like syntax > could be covered by Spark SQL completely. Spark 2.0 substantially improved > SQL functionalities with SQL2003 support and can now run all 99 TPC-DS > queries. > > So is it possible for Griffin-DSL to remove all SQL-like language features? > All rules, which could be expressed by SQL, would be categorized as > "spark-sql" DSL type instead of "griffin-dsl". In this case, we could > simplify the implementation of Griffin-DSL. > > For my understanding, Griffin-DSL should be the high-order expressions, > each of them represents a specific set of semantics. Griffin-DSL continues > focusing on the expressions with the richer semantics in data exploration > or wrangling area, and leaves all SQL compatible expressions to Spark SQL. > Griffin-DSL is still translated into Spark-SQL when being executed. > > here is an example from the unit test "_accuracy-batch-griffindsl.json" > > "evaluate.rule": { > "rules": [ > { > "dsl.type": "griffin-dsl", > "dq.type": "accuracy", > "out.dataframe.name": "accu", > "rule": "source.user_id = target.user_id AND > upper(source.first_name) = upper(target.first_name) AND source.last_name = > target.last_name AND source.address = target.address AND source.email = > target.email AND source.phone = target.phone AND source.post_code = > target.post_code", > "details": { > "source": "source", > "target": "target", > "miss": "miss_count", > "total": "total_count", > "matched": "matched_count" > }, > "out":[ > { > "type": "record", > "name": "missRecords" > } > ] > } > ] > } > > If we move SQL-like syntax out of Griffin-DSL, the preceding example will > take "dsl.type" as "spark-sql", and "rule" would be probably a list of > columns or all columns by default. > > Discussions are welcomed. > > Grant >
