asmello edited a comment on issue #23882: [SPARK-26979][PySpark][WIP] Add missing column name support for some SQL functions URL: https://github.com/apache/spark/pull/23882#issuecomment-468942194 Ok, this was tedious. I've whitelisted all functions defined in `functions.py`, save for the previously identified exceptions, which were all being defined automatically. All these functions take columns as arguments, and they **explicitly** handle Column/name duality: ``` approx_count_distinct coalesce corr covar_pop covar_samp countDistinct first grouping grouping_id isnan isnull last nanvl round bround shiftLeft shiftRight shiftRightUnsigned struct greatest least when log log2 conv factorial lag col year quarter month dayofweek dayofmonth dayofyear hour minute second weekofyear date_add date_sub datediff add_months months_between to_date to_timestamp trunc date_trunc next_day last_day from_unixtime unix_timestamp from_utc_timestamp to_utc_timestamp window crc32 md5 sha1 sha2 hash concat_ws decode encode format_number format_string instr substring substring_index levenshtein locate lpad rpad repeat split regexp_extract regexp_replace soundex bin hex unhex length translate create_map map_from_arrays array array_contains arrays_overlap slice def concat array_position element_at array_remove array_distinct array_intersect array_union array_except explode posexplode explode_outer posexplode_outer get_json_object json_tuple from_json to_json schema_of_json schema_of_csv col size array_min array_max sort_array array_sort shuffle reverse flatten map_keys map_values map_entries map_from_entries array_repeat arrays_zip map_concat sequence from_csv ``` @HyukjinKwon you mentioned `from_json()`, but that's not a true exception. The schema argument is not meant to be an actual column, it's a column specification, so it's a completely different case. That same function also takes an actual column to operate on, which it explicitly accepts as a name or instance. So there's absolutely no problem there. As for the dataframe methods in `dataframe.py`, these functions also take column names or instances and do explicit conversion: ``` sampleBy sort select groupBy rollup cube dropna fillna replace drop toDF ``` I was even a little surprised by `drop()`, I thought it would only take names, but it does handle both cases. **EDIT: Sorry, I double checked and it does support both, but only when you give a single argument. If you drop more than one column, it enforces a check for strings. And the reason is because there is no Seq<Column> support on the Scala side.** The following functions are also 100% compliant with both API styles, but they rely on the jvm backend to do the conversion: ``` join sortWithinPartitions describe __getitem__ __getattr__ dropDuplicates ``` This is something that should be kept in mind should the Scala API change. The following functions are violations, they only take **strings**: ``` approxQuantile corr cov crosstab freqItem ``` The reason these are exceptions is that they're implemented in `org.apache.spark.sql.DataFrameStatFunctions`, and for some reason almost all functions there only take name arguments. The PySpark never converts from Column object to string, so it also only supports the name spec. This is something that should be fixed on the Scala side IMO. Since this patch is about making _name_ support universal, however, I don't think it falls under scope to do anything about the `drop()` and stats violations for now. One other special case I should mention is `colRegex`, which also only takes strings. But in that case, it just wouldn't make sense to take Column objects (how would you specific a regex that way???), so that is fine. With these checks in place, I'm confident the list of exceptions in this PR's description **is** exhaustive. I will work on solving those now as agreed (by redefining the `_create_function` argument passing strategy to do explicit conversion).
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
