[GitHub] [spark] pralabhkumar commented on pull request #37009: [SPARK-38292][PYTHON]Support na_filter for pyspark.pandas.read_csv

GitBox Thu, 21 Jul 2022 09:11:35 -0700


pralabhkumar commented on PR #37009:
URL: https://github.com/apache/spark/pull/37009#issuecomment-1191680230


   > Hey, I think the fix here is too hacky. Can we make this working 
independently with other options being set?
   
   Hi @HyukjinKwon 
   
   Thx for reviewing  . Had again gone through the code. Here is my 
understanding (same is mentioned in jira)
   
   IMHO 
   Setting nullValue option  will not help here . Since whatever the value we 
set ',,] string will be converted to the value  by Univocityparser(external)  
which we have set. 
   For e.g A,, and if setNullValue(“B”)  will result to A,B by univocity parser 
. Then spark univocity parser in nullSafeDatum
   Will always convert to null (since datum == options.nullValue) .  So output 
will always be A, null whereas we need A,,
   .There IMHO setting nullValue will not help here unless we have 
options.naFilter value to False which will make sure the above condition 
doesn't satisfy. 
   
   Now in case of missing values in the beginning and end of line , current 
logic in convert method of UnivocietyParser is to go into exception
   row.update(i, requiredSchema.existenceDefaultValues(i)) and update with 
default value.  Now we don’t want values to set to null in case 
options.naFilter is false. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] pralabhkumar commented on pull request #37009: [SPARK-38292][PYTHON]Support na_filter for pyspark.pandas.read_csv

Reply via email to