koertkuipers commented on pull request #29516: URL: https://github.com/apache/spark/pull/29516#issuecomment-791135152
> Isn't that expected? or can you set the comment char to something else? > […](#) > On Thu, Mar 4, 2021 at 5:41 PM koertkuipers ***@***.***> wrote: this has unintended side effect of now dropping rows that start with # we ran into this because we had comments disabled but we noticed that rows in a csv that start with # were dropped — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <[#29516 (comment)](https://github.com/apache/spark/pull/29516#issuecomment-791032048)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGIZ6QY3FJXY33QJFCRVKDTCALDDANCNFSM4QIJ32RA> . if in spark csv comment is not set (isCommentSet is false) then univocity should process with comment feature disabled. per univocity documentation the way to do this is to set comment to `\0`. i realize this seems to be not working exactly as desired although i not yet fully grasp how or why. but what we are doing now instead is: if in spark comment is not set (isCommentSet is false) then we leave the default comment in univocity, which is `#`. that is not the same as unsetting/disabling comment feature. i feel like this might be confusing and maybe also can have unintended consequences? i am still unsure how this is impacting us but what i see is that when we disable comment feature in spark csv we see univocity quote values that start with `#` upon writing. since we are generated bar delimited output for systems that do not support quotes this causes trouble for us. we actually had disabled quote (which sets it to `\0`) leading to `#` becoming `\0#\0` upon writing, which then in older versions of spark was considered a comment line and got dropped! so i spend a few days going down the rabbit hole of trying to understand why we were losing records... it came down to this change in behavior here. you are right that for my particular issue this can be fixed by explicitly setting comment to something else (as long as its not `\0`). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
