I agree with the separation of Griffin-DSL and Spark-SQL, I have some concerns and suggestions for detail: 1. The rule in the "accuracy" example above is only part of sql, not a complete sql, users would be confused if the "dsl.type" is set as "spark-sql". 2. The benefits to separate Griffin-DSL and Spark-SQL is to reduce the user confusion and make Griffin-DSL focus on the DQ domain description, we can define the rule description in a specified format for each specified "dq.type", make it clearer and richer, not only in the "rule" string. 3. About the backward compatibility, we can add a "version" field in the rule or in the whole configuration, read it as "v1" if not set by default, for the new semantics we need to set it as "v2" (maybe in the future we'll have more versions).
On Wed, Jan 30, 2019 at 7:11 AM Grant <[email protected]> wrote: > We could have a SQL syntax checker using the existing parser logic, > > Once it detects the SQL expression with the DSL type "griffin-dsl", it > could take the following steps > 1. attempt to delegate the execution of the rule to "spark-sql" type > directly. Whether the execution is successful or not, run the step 2 > 2. notify the user to use "spark-sql" in the future > > We only keep the checker in the distribution only for several > releases(say,2 or 3). And then we remove it. > > Another thing I am thinking is we should consider to support UDF provided > by the end users. > > On Tue, Jan 29, 2019 at 5:35 PM Nick Sokolov <[email protected]> wrote: > > > I think we need to maintain backward compatibility or provide easy > > (automated?) migration -- otherwise existing users will be stuck in older > > versions. > > > > On Tue, Jan 29, 2019 at 2:28 PM William Guo <[email protected]> wrote: > > > > > Thanks Grant. > > > > > > I agree Griffin-DSL should leverage spark-sql for sql part , and > > > Griffin-DSL should work as DQ layer to assemble different dimensions as > > > MLlib does. > > > Since we already have some experiences for data quality domain, it is > now > > > for Griffin-DSL to evolve to next level. > > > > > > Thanks, > > > William > > > > > > > > > On Wed, Jan 30, 2019 at 5:48 AM Grant <[email protected]> wrote: > > > > > > > Hi all, > > > > > > > > I would suggest simplifying Griffin-DSL. > > > > > > > > Currently, Griffin supports three types of DSL: spark-sql, > griffin-dsl > > > and > > > > df-ops respectively. In this proposal, I only focus on the first two. > > > > > > > > Griffin-DSL is a SQL-like language, supporst a wide range of clauses, > > key > > > > words, operators etc as Spark SQL. class "GriffinDslParser" also > > defines > > > > how to parse the SQL-like syntax. Actually, Griffin-DSL's SQL-like > > syntax > > > > could be covered by Spark SQL completely. Spark 2.0 substantially > > > improved > > > > SQL functionalities with SQL2003 support and can now run all 99 > TPC-DS > > > > queries. > > > > > > > > So is it possible for Griffin-DSL to remove all SQL-like language > > > features? > > > > All rules, which could be expressed by SQL, would be categorized as > > > > "spark-sql" DSL type instead of "griffin-dsl". In this case, we could > > > > simplify the implementation of Griffin-DSL. > > > > > > > > For my understanding, Griffin-DSL should be the high-order > expressions, > > > > each of them represents a specific set of semantics. Griffin-DSL > > > continues > > > > focusing on the expressions with the richer semantics in data > > exploration > > > > or wrangling area, and leaves all SQL compatible expressions to Spark > > > SQL. > > > > Griffin-DSL is still translated into Spark-SQL when being executed. > > > > > > > > here is an example from the unit test > "_accuracy-batch-griffindsl.json" > > > > > > > > "evaluate.rule": { > > > > "rules": [ > > > > { > > > > "dsl.type": "griffin-dsl", > > > > "dq.type": "accuracy", > > > > "out.dataframe.name": "accu", > > > > "rule": "source.user_id = target.user_id AND > > > > upper(source.first_name) = upper(target.first_name) AND > > source.last_name > > > = > > > > target.last_name AND source.address = target.address AND > source.email = > > > > target.email AND source.phone = target.phone AND source.post_code = > > > > target.post_code", > > > > "details": { > > > > "source": "source", > > > > "target": "target", > > > > "miss": "miss_count", > > > > "total": "total_count", > > > > "matched": "matched_count" > > > > }, > > > > "out":[ > > > > { > > > > "type": "record", > > > > "name": "missRecords" > > > > } > > > > ] > > > > } > > > > ] > > > > } > > > > > > > > If we move SQL-like syntax out of Griffin-DSL, the preceding > example > > > will > > > > take "dsl.type" as "spark-sql", and "rule" would be probably a list > of > > > > columns or all columns by default. > > > > > > > > Discussions are welcomed. > > > > > > > > Grant > > > > > > > > > >
