I agree with Lionel's suggestions. 1. define the rule descriptions for each "dq.type" under the umbrella of Griffin-DSL 2. add the version field in the configuration
To Nick and Eugene As for back compatibility, as Eugene said, Griffin is still in early stage, it might experience the significant design changes inevitably. So personally I am inclined to adopt the relatively simple policy so far. Also, we could consider building the tools to convert the configurations from lower version to higher version Thanks On Wed, Jan 30, 2019 at 1:18 AM Eugene Liu <[email protected]> wrote: > in early stage, I think Griffin should consider smoothy upgrade/migration > strategy, allowing all lower versions to transform up to the latest release. > > after some stable releases like 1.x, 2.x..., maybe we do not consider > rolling upgrade. > ------------------------------ > *From:* Grant <[email protected]> > *Sent:* Wednesday, January 30, 2019 7:06 AM > *To:* [email protected] > *Subject:* Re: Simplify Griffin-DSL implementation > > We could have a SQL syntax checker using the existing parser logic, > > Once it detects the SQL expression with the DSL type "griffin-dsl", it > could take the following steps > 1. attempt to delegate the execution of the rule to "spark-sql" type > directly. Whether the execution is successful or not, run the step 2 > 2. notify the user to use "spark-sql" in the future > > We only keep the checker in the distribution only for several > releases(say,2 or 3). And then we remove it. > > Another thing I am thinking is we should consider to support UDF provided > by the end users. > > On Tue, Jan 29, 2019 at 5:35 PM Nick Sokolov <[email protected]> wrote: > > > I think we need to maintain backward compatibility or provide easy > > (automated?) migration -- otherwise existing users will be stuck in older > > versions. > > > > On Tue, Jan 29, 2019 at 2:28 PM William Guo <[email protected]> wrote: > > > > > Thanks Grant. > > > > > > I agree Griffin-DSL should leverage spark-sql for sql part , and > > > Griffin-DSL should work as DQ layer to assemble different dimensions as > > > MLlib does. > > > Since we already have some experiences for data quality domain, it is > now > > > for Griffin-DSL to evolve to next level. > > > > > > Thanks, > > > William > > > > > > > > > On Wed, Jan 30, 2019 at 5:48 AM Grant <[email protected]> wrote: > > > > > > > Hi all, > > > > > > > > I would suggest simplifying Griffin-DSL. > > > > > > > > Currently, Griffin supports three types of DSL: spark-sql, > griffin-dsl > > > and > > > > df-ops respectively. In this proposal, I only focus on the first two. > > > > > > > > Griffin-DSL is a SQL-like language, supporst a wide range of clauses, > > key > > > > words, operators etc as Spark SQL. class "GriffinDslParser" also > > defines > > > > how to parse the SQL-like syntax. Actually, Griffin-DSL's SQL-like > > syntax > > > > could be covered by Spark SQL completely. Spark 2.0 substantially > > > improved > > > > SQL functionalities with SQL2003 support and can now run all 99 > TPC-DS > > > > queries. > > > > > > > > So is it possible for Griffin-DSL to remove all SQL-like language > > > features? > > > > All rules, which could be expressed by SQL, would be categorized as > > > > "spark-sql" DSL type instead of "griffin-dsl". In this case, we could > > > > simplify the implementation of Griffin-DSL. > > > > > > > > For my understanding, Griffin-DSL should be the high-order > expressions, > > > > each of them represents a specific set of semantics. Griffin-DSL > > > continues > > > > focusing on the expressions with the richer semantics in data > > exploration > > > > or wrangling area, and leaves all SQL compatible expressions to Spark > > > SQL. > > > > Griffin-DSL is still translated into Spark-SQL when being executed. > > > > > > > > here is an example from the unit test > "_accuracy-batch-griffindsl.json" > > > > > > > > "evaluate.rule": { > > > > "rules": [ > > > > { > > > > "dsl.type": "griffin-dsl", > > > > "dq.type": "accuracy", > > > > "out.dataframe.name": "accu", > > > > "rule": "source.user_id = target.user_id AND > > > > upper(source.first_name) = upper(target.first_name) AND > > source.last_name > > > = > > > > target.last_name AND source.address = target.address AND > source.email = > > > > target.email AND source.phone = target.phone AND source.post_code = > > > > target.post_code", > > > > "details": { > > > > "source": "source", > > > > "target": "target", > > > > "miss": "miss_count", > > > > "total": "total_count", > > > > "matched": "matched_count" > > > > }, > > > > "out":[ > > > > { > > > > "type": "record", > > > > "name": "missRecords" > > > > } > > > > ] > > > > } > > > > ] > > > > } > > > > > > > > If we move SQL-like syntax out of Griffin-DSL, the preceding > example > > > will > > > > take "dsl.type" as "spark-sql", and "rule" would be probably a list > of > > > > columns or all columns by default. > > > > > > > > Discussions are welcomed. > > > > > > > > Grant > > > > > > > > > >
