[GitHub] [spark] EnricoMi opened a new pull request #26936: [SPARK-30296][SQL] Add Dataset diffing feature

GitBox Wed, 18 Dec 2019 04:23:25 -0800

EnricoMi opened a new pull request #26936: [SPARK-30296][SQL] Add Dataset 
diffing feature
URL: https://github.com/apache/spark/pull/26936
 
 
   ### What changes were proposed in this pull request?
   Adds a `diff` transformation to `Dataset` that computes the differences 
between the two datasets, i.e. which rows of `this` dataset to _add_, _delete_ 
or _change_ to get to the given dataset.
   
   With
   ```
   val left = Seq((1, "one"), (2, "two"), (3, "three")).toDF("id", "value")
   val right = Seq((1, "one"), (2, "Two"), (4, "four")).toDF("id", "value")
   ```
   Diffing becomes as easy as:
   ```
   left.diff(right).show()
   ```
   |diff| id|value|
   |----|---|-----|
   |   N|  1|  one|
   |   D|  2|  two|
   |   I|  2|  Two|
   |   D|  3|three|
   |   I|  4| four|
   
   With columns that provide unique identifiers per row (here `id`), the diff 
looks like:
   ```
   left.diff(right, "id").show()
   ```
   |diff| id|left_value|right_value|
   |----|---|----------|-----------|
   |   N|  1|       one|        one|
   |   C|  2|       two|        Two|
   |   D|  3|     three|       null|
   |   I|  4|      null|       four|
   
   
   Equivalent alternative is this hand-crafted transformation
   ```
   left.withColumn("exists", lit(1)).as("l")
     .join(right.withColumn("exists", lit(1)).as("r"),
       $"l.id" <=> $"r.id",
       "fullouter")
     .withColumn("diff",
       when($"l.exists".isNull, "I").
         when($"r.exists".isNull, "D").
         when(!($"l.value" <=> $"r.value"), "C").
         otherwise("N"))
     .show()
   ```
   
   Statistics on the differences can be obtained by
   ```
   left.diff(right, "id").groupBy("diff").count().show()
   ```
   
   |diff|count|
   |----|-----|
   |   N|    1|
   |   I|    1|
   |   D|    1|
   |   C|    1|
   
   This `diff` provides the following features:
   * id columns are optional
   * provides typed `diffAs` transformations
   * supports null values in id and non-id columns
   * detects null value insertion / deletion
   * configurable via `DiffOptions`:
     * diff column name (default: `"diff"`), for diffing datasets that already 
contain `diff` column
     * diff action labels (defaults: `"N"`, `"I"`, `"D"`, `"C"`), allows custom 
diff notation,
   e.g. Unix diff left-right notation (<, >) or git before-after format (+, -, 
-+)
   
   ### Why are the changes needed?
   Your evolving code need frequent regression testing to prove it still 
produces identical results, or if changes are expected, to investigate those 
changes. Diffing the results of two code paths provides the confidence you 
need. Diffing small schemata is easy, but with wide schema the Spark query 
becomes laborious and error-prone. With a single proven and tested method, 
diffing becomes easier and a more reliable operation. This has proven to be 
useful for interactive spark as well as deployed production code.
   
   ### Does this PR introduce any user-facing change?
   Yes, it provides new transformations added to `Dataset`:
   
   * `def diff(other: Dataset[T], idColumns: String*): DataFrame`
   * `def diff(other: Dataset[T], options: DiffOptions, idColumns: String*): 
DataFrame`
   * `def diffAs[U](other: Dataset[T], idColumns: String*)(implicit 
diffEncoder: Encoder[U]): Dataset[U]`
   * `def diffAs[U](other: Dataset[T], options: DiffOptions, idColumns: 
String*)(implicit diffEncoder: Encoder[U]): Dataset[U]`
   * `def diffAs[U](other: Dataset[T], diffEncoder: Encoder[U], idColumns: 
String*): Dataset[U]`
   * `def diffAs[U](other: Dataset[T], options: DiffOptions, diffEncoder: 
Encoder[U], idColumns: String*): Dataset[U]`
   
   ### How was this patch tested?
   There is a new suite with plenty of tests: `DiffSuite`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] EnricoMi opened a new pull request #26936: [SPARK-30296][SQL] Add Dataset diffing feature

Reply via email to