[ https://issues.apache.org/jira/browse/SPARK-38292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557537#comment-17557537 ]
pralabhkumar commented on SPARK-38292: -------------------------------------- [~hyukjin.kwon] Thx for the suggestion . After going through the code (DataFrameReader and Univocity spark parser code) . Here is the analysis . Example A,,B A,,B ==> spark.read.option(“nullValue”,”A”) ==> results in null, null, B Reason for this is * _parse method in_ org.apache.spark.sql.catalyst.csv.UnivocityParser * Parse string => A, A,B (settings.setNullValue in com.univocity.parsers.csv.CsvParser replaces the ,, value with A) * Now nullSafeDatum will check if (datum == options.{_}nullValue{_} || datum == null) and return null for both the values , since datum = options.nullValue => null, null, B * Not sure if this is expected output since from com.univocity.parsers.csv.CsvParser point of view expected output should be “A,A,B” after setting .setNullValue("A") *Solution* Now in case of na_filter , what I am thinking is to add one property if ( (na_filter && datum == options.{_}nullValue)|| datum == null){_} _Now if the input string is A,,B and user have set na_filter to False , then_ com.univocity.parsers.csv.CsvParser will return as its is since setNullValue is (“”) And then (na_filter && datum == options.{_}nullValue) condition become false and{_} converter.apply(datum) , which will leave the value as its . > Support `na_filter` for pyspark.pandas.read_csv > ----------------------------------------------- > > Key: SPARK-38292 > URL: https://issues.apache.org/jira/browse/SPARK-38292 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 3.3.0 > Reporter: Haejoon Lee > Priority: Major > > pandas support `na_filter` parameter for `read_csv` function. > (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) > We also want to support this to follow the behavior of pandas. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org