koertkuipers commented on pull request #29516:
URL: https://github.com/apache/spark/pull/29516#issuecomment-791135152


   
   
   
   > Isn't that expected? or can you set the comment char to something else?
   > […](#)
   > On Thu, Mar 4, 2021 at 5:41 PM koertkuipers ***@***.***> wrote: this has 
unintended side effect of now dropping rows that start with # we ran into this 
because we had comments disabled but we noticed that rows in a csv that start 
with # were dropped — You are receiving this because you were mentioned. Reply 
to this email directly, view it on GitHub <[#29516 
(comment)](https://github.com/apache/spark/pull/29516#issuecomment-791032048)>, 
or unsubscribe 
<https://github.com/notifications/unsubscribe-auth/AAGIZ6QY3FJXY33QJFCRVKDTCALDDANCNFSM4QIJ32RA>
 .
   
   if in spark csv comment is not set (isCommentSet is false) then univocity 
should process with comment feature disabled. per univocity documentation the 
way to do this is to set comment to `\0`. i realize this seems to be not 
working exactly as desired although i not yet fully grasp how or why.
   
   but what  we are doing now instead is: if in spark comment is not set 
(isCommentSet is false) then we leave the default comment in univocity, which 
is `#`. that is not the same as unsetting/disabling comment feature. i feel 
like this might be confusing and maybe also can have unintended consequences?
   
   i am still unsure how this is impacting us but what i see is that when we 
disable comment feature in spark csv we see univocity quote values that start 
with `#` upon writing. since we are generated bar delimited output for systems 
that do not support quotes this causes trouble for us. 
   we actually had disabled quote (which sets it to `\0`) leading to `#` 
becoming `\0#\0` upon writing, which then in older versions of spark was 
considered a comment line and got dropped! so i spend a few days going down the 
rabbit hole of trying to understand why we were losing records... it came down 
to this change in behavior here.
   you are right that for my particular issue this can be fixed by explicitly 
setting comment to something else (as long as its not `\0`).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to