[jira] [Commented] (SPARK-38292) Support `na_filter` for pyspark.pandas.read_csv

pralabhkumar (Jira) Wed, 22 Jun 2022 08:46:10 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-38292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557537#comment-17557537
 ]


pralabhkumar commented on SPARK-38292:
--------------------------------------

[~hyukjin.kwon] Thx for the suggestion . 

 

After going through the code (DataFrameReader and Univocity spark parser code) 
. Here is the analysis .

Example A,,B

A,,B ==> spark.read.option(“nullValue”,”A”) ==> results in null, null, B

Reason for this is 
 * _parse method in_ org.apache.spark.sql.catalyst.csv.UnivocityParser
 * Parse string => A, A,B (settings.setNullValue in 
com.univocity.parsers.csv.CsvParser replaces the ,, value with A)
 * Now nullSafeDatum will check if (datum == options.{_}nullValue{_} || datum 
== null) and return null for both the values , since datum = options.nullValue 
=> null, null, B
 * Not sure if this is expected  output since from  
com.univocity.parsers.csv.CsvParser point of view expected output should be 
“A,A,B” after setting .setNullValue("A")

 

*Solution*

Now in case of na_filter ,  what I am thinking is to add one property if ( 
(na_filter &&  datum == options.{_}nullValue)|| datum == null){_}

_Now if the input string is A,,B and user have set na_filter to False , then_ 
com.univocity.parsers.csv.CsvParser will return as its is since setNullValue is 
(“”) 

And then (na_filter &&  datum == options.{_}nullValue) condition become false 
and{_} converter.apply(datum) , which will leave the value as its . 

> Support `na_filter` for pyspark.pandas.read_csv
> -----------------------------------------------
>
>                 Key: SPARK-38292
>                 URL: https://issues.apache.org/jira/browse/SPARK-38292
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.3.0
>            Reporter: Haejoon Lee
>            Priority: Major
>
> pandas support `na_filter` parameter for `read_csv` function. 
> (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
> We also want to support this to follow the behavior of pandas.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38292) Support `na_filter` for pyspark.pandas.read_csv

Reply via email to