[jira] [Commented] (DATAFU-159) Add diff functionality to datafu-spark

Anna O (Jira) Wed, 19 Feb 2025 05:36:19 -0800


    [ 
https://issues.apache.org/jira/browse/DATAFU-159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17928424#comment-17928424
 ]


Anna O commented on DATAFU-159:
-------------------------------

Sure, [~eyal]
{code:java}
val df1 = Seq(
(1, "Alice", 30, Map("skill1" -> "expert", "skill2" -> "beginner","skill3" -> 
"intermediate")),
(2, "Bob", 25, Map("skill1" -> "expert")),
(3, "Charlie", 35, Map("skill2" -> "expert","skill3" -> "expert"))
).toDF("id", "name", "age", "skills")

val df2 = Seq(
(1, "Alice", 30, Map("skill2" -> "beginner","skill3" -> "intermediate","skill1" 
-> "expert")),
(2, "Bob", 24, Map("skill2" -> "expert")),
(4, "David", 40, Map("skill1" -> "beginner"))
).toDF("id", "name", "age", "skills")

val keys = Some(List("id"))
compareDFs(df1, df2, keys).show(false){code}
The output:
{code:java}
+-------+----------------------------+-----+
|column |metric                      |value|
+-------+----------------------------+-----+
|name   |non_numeric_diff_percent    |0.0  |
|age    |min_diff                    |0.0  |
|age    |max_diff                    |4.0  |
|age    |mean_diff                   |2.0  |
|age    |one_sided_null_percent      |0.0  |
|age    |stddev_diff                 |2.83 |
|age    |under_1%_diff_percent       |50.0 |
|age    |under_5%_diff_percent       |100.0|
|age    |under_10%_diff_percent      |100.0|
|skills |non_numeric_diff_percent    |50.0 |
|general|df2_non_matched_keys_percent|33.33|
|general|df1_non_matched_keys_percent|33.33|
+-------+----------------------------+-----+
{code}

> Add diff functionality to datafu-spark
> --------------------------------------
>
>                 Key: DATAFU-159
>                 URL: https://issues.apache.org/jira/browse/DATAFU-159
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Eyal Allweil
>            Priority: Major
>
> A useful feature when examining results is the ability to clearly understand 
> the differences between two datasets - for example, doing regressions between 
> expected and actual results.
> Spark provides the _except_ functionality, but this is often not enough for 
> this - for example, see [this question on Stack 
> Overflow.|https://stackoverflow.com/questions/44338412/how-to-compare-two-dataframe-and-print-columns-that-are-different-in-scala]
> Datafu-pig had a macro for doing this, and this could be a useful addition to 
> datafu-spark.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (DATAFU-159) Add diff functionality to datafu-spark

Reply via email to