[ https://issues.apache.org/jira/browse/DATAFU-159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17928424#comment-17928424 ]
Anna O commented on DATAFU-159: ------------------------------- Sure, [~eyal] {code:java} val df1 = Seq( (1, "Alice", 30, Map("skill1" -> "expert", "skill2" -> "beginner","skill3" -> "intermediate")), (2, "Bob", 25, Map("skill1" -> "expert")), (3, "Charlie", 35, Map("skill2" -> "expert","skill3" -> "expert")) ).toDF("id", "name", "age", "skills") val df2 = Seq( (1, "Alice", 30, Map("skill2" -> "beginner","skill3" -> "intermediate","skill1" -> "expert")), (2, "Bob", 24, Map("skill2" -> "expert")), (4, "David", 40, Map("skill1" -> "beginner")) ).toDF("id", "name", "age", "skills") val keys = Some(List("id")) compareDFs(df1, df2, keys).show(false){code} The output: {code:java} +-------+----------------------------+-----+ |column |metric |value| +-------+----------------------------+-----+ |name |non_numeric_diff_percent |0.0 | |age |min_diff |0.0 | |age |max_diff |4.0 | |age |mean_diff |2.0 | |age |one_sided_null_percent |0.0 | |age |stddev_diff |2.83 | |age |under_1%_diff_percent |50.0 | |age |under_5%_diff_percent |100.0| |age |under_10%_diff_percent |100.0| |skills |non_numeric_diff_percent |50.0 | |general|df2_non_matched_keys_percent|33.33| |general|df1_non_matched_keys_percent|33.33| +-------+----------------------------+-----+ {code} > Add diff functionality to datafu-spark > -------------------------------------- > > Key: DATAFU-159 > URL: https://issues.apache.org/jira/browse/DATAFU-159 > Project: DataFu > Issue Type: New Feature > Reporter: Eyal Allweil > Priority: Major > > A useful feature when examining results is the ability to clearly understand > the differences between two datasets - for example, doing regressions between > expected and actual results. > Spark provides the _except_ functionality, but this is often not enough for > this - for example, see [this question on Stack > Overflow.|https://stackoverflow.com/questions/44338412/how-to-compare-two-dataframe-and-print-columns-that-are-different-in-scala] > Datafu-pig had a macro for doing this, and this could be a useful addition to > datafu-spark. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)