[GitHub] asmello edited a comment on issue #23882: [SPARK-26979][PySpark][WIP] Add missing column name support for some SQL functions

GitBox Sat, 02 Mar 2019 09:53:44 -0800

asmello edited a comment on issue #23882: [SPARK-26979][PySpark][WIP] Add 
missing column name support for some SQL functions
URL: https://github.com/apache/spark/pull/23882#issuecomment-468942194
 
 
   Ok, this was tedious. I've whitelisted all functions defined in 
`functions.py`, save for the previously identified exceptions, which were all 
being defined automatically.
   
   All these functions take columns as arguments, and they **explicitly** 
handle Column/name duality:
   
   ```
   approx_count_distinct
   coalesce
   corr
   covar_pop
   covar_samp
   countDistinct
   first
   grouping
   grouping_id
   isnan
   isnull
   last
   nanvl
   round
   bround
   shiftLeft
   shiftRight
   shiftRightUnsigned
   struct
   greatest
   least
   when
   log
   log2
   conv
   factorial
   lag
   col
   year
   quarter
   month
   dayofweek
   dayofmonth
   dayofyear
   hour
   minute
   second
   weekofyear
   date_add
   date_sub
   datediff
   add_months
   months_between
   to_date
   to_timestamp
   trunc
   date_trunc
   next_day
   last_day
   from_unixtime
   unix_timestamp
   from_utc_timestamp
   to_utc_timestamp
   window
   crc32
   md5
   sha1
   sha2
   hash
   concat_ws
   decode
   encode
   format_number
   format_string
   instr
   substring
   substring_index
   levenshtein
   locate
   lpad
   rpad
   repeat
   split
   regexp_extract
   regexp_replace
   soundex
   bin
   hex
   unhex
   length
   translate
   create_map
   map_from_arrays
   array
   array_contains
   arrays_overlap
   slice
   def
   concat
   array_position
   element_at
   array_remove
   array_distinct
   array_intersect
   array_union
   array_except
   explode
   posexplode
   explode_outer
   posexplode_outer
   get_json_object
   json_tuple
   from_json
   to_json
   schema_of_json
   schema_of_csv
   col
   size
   array_min
   array_max
   sort_array
   array_sort
   shuffle
   reverse
   flatten
   map_keys
   map_values
   map_entries
   map_from_entries
   array_repeat
   arrays_zip
   map_concat
   sequence
   from_csv
   ```
   
   @HyukjinKwon you mentioned `from_json()`, but that's not a true exception. 
The schema argument is not meant to be an actual column, it's a column 
specification, so it's a completely different case. That same function also 
takes an actual column to operate on, which it explicitly accepts as a name or 
instance. So there's absolutely no problem there.
   
   As for the dataframe methods in `dataframe.py`, these functions also take 
column names or instances and do explicit conversion:
   
   ```
   sampleBy
   sort
   select
   groupBy
   rollup
   cube
   dropna
   fillna
   replace
   drop
   toDF
   ```
   
   I was even a little surprised by `drop()`, I thought it would only take 
names, but it does handle both cases. **EDIT: Sorry, I double checked and it 
does support both, but only when you give a single argument. If you drop more 
than one column, it enforces a check for strings. And the reason is because 
there is no Seq<Column> support on the Scala side.**
   
   The following functions are also 100% compliant with both API styles, but 
they rely on the jvm backend to do the conversion:
   
   ```
   join
   sortWithinPartitions
   describe
   __getitem__
   __getattr__
   dropDuplicates
   ```
   
   This is something that should be kept in mind should the Scala API change.
   
   The following functions are violations, they only take **strings**:
   
   ```
   approxQuantile
   corr
   cov
   crosstab
   freqItem
   ```
   
   The reason these are exceptions is that they're implemented in 
`org.apache.spark.sql.DataFrameStatFunctions`, and for some reason almost all 
functions there only take name arguments. The PySpark never converts from 
Column object to string, so it also only supports the name spec. This is 
something that should be fixed on the Scala side IMO.
   
   Since this patch is about making _name_ support universal, however, I don't 
think it falls under scope to do anything about the `drop()` and stats 
violations for now.
   
   One other special case I should mention is `colRegex`, which also only takes 
strings. But in that case, it just wouldn't make sense to take Column objects 
(how would you specific a regex that way???), so that is fine.
   
   With these checks in place, I'm confident the list of exceptions in this 
PR's description **is** exhaustive. I will work on solving those now as agreed 
(by redefining the `_create_function` argument passing strategy to do explicit 
conversion).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] asmello edited a comment on issue #23882: [SPARK-26979][PySpark][WIP] Add missing column name support for some SQL functions

Reply via email to