Hi all,

I would suggest simplifying Griffin-DSL.

Currently, Griffin supports three types of DSL: spark-sql, griffin-dsl and
df-ops respectively. In this proposal, I only focus on the first two.

Griffin-DSL is a SQL-like language, supporst a wide range of clauses, key
words, operators etc as Spark SQL. class "GriffinDslParser" also defines
how to parse the SQL-like syntax. Actually, Griffin-DSL's SQL-like syntax
could be covered by Spark SQL completely. Spark 2.0 substantially improved
SQL functionalities with SQL2003 support and can now run all 99 TPC-DS
queries.

So is it possible for Griffin-DSL to remove all SQL-like language features?
All rules, which could be expressed by SQL, would be categorized as
"spark-sql" DSL type instead of "griffin-dsl". In this case, we could
simplify the implementation of Griffin-DSL.

For my understanding, Griffin-DSL should be the high-order expressions,
each of them represents a specific set of semantics. Griffin-DSL continues
focusing on the expressions with the richer semantics in data exploration
or wrangling area, and leaves all SQL compatible expressions to Spark SQL.
Griffin-DSL is still translated into Spark-SQL when being executed.

here is an example from the unit test "_accuracy-batch-griffindsl.json"

"evaluate.rule": {
    "rules": [
      {
        "dsl.type": "griffin-dsl",
        "dq.type": "accuracy",
        "out.dataframe.name": "accu",
        "rule": "source.user_id = target.user_id AND
upper(source.first_name) = upper(target.first_name) AND source.last_name =
target.last_name AND source.address = target.address AND source.email =
target.email AND source.phone = target.phone AND source.post_code =
target.post_code",
        "details": {
          "source": "source",
          "target": "target",
          "miss": "miss_count",
          "total": "total_count",
          "matched": "matched_count"
        },
        "out":[
          {
            "type": "record",
            "name": "missRecords"
          }
        ]
      }
    ]
  }

  If we move SQL-like syntax out of Griffin-DSL, the preceding example will
take "dsl.type" as "spark-sql", and "rule" would be probably a list of
columns or all columns by default.

  Discussions are welcomed.

Grant

Reply via email to