[GitHub] [spark] aray commented on a diff in pull request #37303: [SPARK-39883][SQL][TESTS] Add DataFrame function parity check

GitBox Thu, 18 Aug 2022 20:45:33 -0700


aray commented on code in PR #37303:
URL: https://github.com/apache/spark/pull/37303#discussion_r949774668



##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala:
##########
@@ -41,6 +42,100 @@ import org.apache.spark.sql.types._
 class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession {
   import testImplicits._
 
+  test("DataFrame function and SQL functon parity") {
+    // This test compares the available list of DataFrame functions in
+    // org.apache.spark.sql.functions with the SQL function registry. This 
attempts to verify that
+    // the DataFrame functions are a subset of the functions in the SQL 
function registry (subject
+    // to exclusions and expectations). It also produces a list of the 
differences between the two.
+    // See also test_function_parity in test_functions.py.
+    //
+    // NOTE FOR DEVELOPERS:
+    // If this test fails one of the following needs to happen
+    // * If a function was added to org.apache.spark.sql.functions but not the 
function registry
+    //     add it to the below expectedOnlyDataFrameFunctions set.
+    // * If it's not related to an added function then likely one of the 
exclusion lists below
+    //     needs to be updated.
+
+    val excludedDataFrameFunctions = Set(
+      "approxCountDistinct", "bitwiseNOT", "callUDF", 
"monotonicallyIncreasingId", "shiftLeft",
+      "shiftRight", "shiftRightUnsigned", "sumDistinct", "toDegrees", 
"toRadians",
+      // all depreciated
+      "asc", "asc_nulls_first", "asc_nulls_last", "desc", "desc_nulls_first", 
"desc_nulls_last",
+      // sorting in sql is not a function
+      "bitwise_not", // equivalent to ~expression in sql
+      "broadcast", // hints are not done with functions in sql
+      "call_udf", // moot in SQL as you just call the function directly
+      "col", "column", "expr", "lit", "negate", // first class functionality 
in SQL
+      "countDistinct", "count_distinct", // equivalent to count(distinct foo)
+      "sum_distinct", // equivalent to sum(distinct foo)
+      "typedLit", "typedlit", // Scala only
+      "udaf", "udf" // create function statement in sql
+    )
+
+    val excludedSqlFunctions = Set(
+      "random", "ceiling", "negative", "sign", "first_value", "last_value",
+      "approx_percentile", "std", "array_agg", "char_length", 
"character_length",
+      "lcase", "position", "printf", "substr", "ucase", "day", "cardinality", 
"sha",
+      "getbit",
+      // aliases for existing functions
+      "reflect", "java_method" // Only needed in SQL
+    )
+
+    val expectedOnlyDataFrameFunctions = Set(
+      "bucket", "days", "hours", "months", "years", // Datasource v2 partition 
transformations
+      "product", // Discussed in https://github.com/apache/spark/pull/30745
+      "unwrap_udt"
+    )
+
+    // We only consider functions matching this pattern, this excludes 
symbolic and other
+    // functions that are not relevant to this comparison
+    val word_pattern = """\w*"""
+
+    // Set of DataFrame functions in org.apache.spark.sql.functions
+    val dataFrameFunctions = functions.getClass
+      .getDeclaredMethods
+      .filter(m => Modifier.isPublic(m.getModifiers))
+      .map(_.getName)
+      .toSet
+      .filter(_.matches(word_pattern))
+      .diff(excludedDataFrameFunctions)
+
+    // Set of SQL functions in the builtin function registry
+    val sqlFunctions = FunctionRegistry.functionSet
+      .map(f => f.funcName)
+      .filter(_.matches(word_pattern))
+      .diff(excludedSqlFunctions)
+
+    val commonCount = dataFrameFunctions.intersect(sqlFunctions).size
+
+    val onlyDataFrameFunctions = dataFrameFunctions.diff(sqlFunctions)
+    val onlySqlFunctions = sqlFunctions.diff(dataFrameFunctions)
+
+    // Check that we did not incorrectly exclude any functions leading to 
false positives
+    assert(onlyDataFrameFunctions.intersect(excludedSqlFunctions) === 
Set.empty)
+    assert(onlySqlFunctions.intersect(excludedDataFrameFunctions) === 
Set.empty)
+
+    // Check that only expected functions are left
+    assert(onlyDataFrameFunctions === expectedOnlyDataFrameFunctions)
+
+    // scalastyle:off println
+    println("Report: DataFrame function and SQL functon parity")

Review Comment:
   Since we don't have any explicit check on the large list of functions that 
are only available in SQL, I though it would be useful for developers to 
produce a report of the functions that are only available there. Granted they 
will need to explicitly run this single test to get that report in a usable 
fashion. For example this report (after pr feedback) will look like: 
   
   Report: DataFrame function and SQL functon parity
     There are 228 relevant functions in the DataFrame API
     There are 366 relevant functions in the SQL function registry
     Number of functions in both sets: 221
     There are 145 SQL functions that are not in the DataFrame API: 
aes_decrypt, aes_encrypt, and, any, any_value, array_size, bigint, binary, 
bit_and, bit_count, bit_get, bit_or, bit_xor, bool_and, bool_or, boolean, 
btrim, cast, char, chr, contains, convert_timezone, count_if, count_min_sketch, 
current_catalog, current_database, current_timezone, current_user, date, 
date_from_unix_date, date_part, decimal, div, double, e, elt, endswith, 
equal_null, every, extract, find_in_set, float, histogram_numeric, if, ifnull, 
ilike, in, inline, inline_outer, input_file_block_length, 
input_file_block_start, int, isnotnull, json_array_length, json_object_keys, 
left, like, ln, make_dt_interval, make_interval, make_timestamp, 
make_timestamp_ltz, make_timestamp_ntz, make_ym_interval, median, mod, mode, 
named_struct, now, nullif, nvl, nvl2, or, parse_url, percentile, pi, positive, 
power, regexp, regexp_count, regexp_extract_all, regexp_instr, regexp_like, 
regexp_substr, regr_avgx, regr_avgy, regr_cou
 nt, regr_intercept, regr_r2, regr_slope, regr_sxx, regr_sxy, regr_syy, 
replace, right, rlike, smallint, some, space, split_part, stack, startswith, 
str_to_map, string, timestamp, timestamp_micros, timestamp_millis, tinyint, 
to_binary, to_char, to_number, to_timestamp_ltz, to_timestamp_ntz, 
to_unix_timestamp, try_add, try_avg, try_divide, try_element_at, try_multiply, 
try_subtract, try_sum, try_to_binary, try_to_number, try_to_timestamp, typeof, 
unix_date, unix_micros, unix_millis, unix_seconds, url_decode, url_encode, 
user, uuid, version, weekday, width_bucket, xpath, xpath_boolean, xpath_double, 
xpath_float, xpath_int, xpath_long, xpath_number, xpath_short, xpath_string
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] aray commented on a diff in pull request #37303: [SPARK-39883][SQL][TESTS] Add DataFrame function parity check

Reply via email to