[jira] [Commented] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

chanduhawk (Jira) Mon, 17 Aug 2020 09:24:23 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179097#comment-17179097
 ]


chanduhawk commented on SPARK-32614:
------------------------------------

[~srowen]

Univocity parser i already raised PR and its merged successfully which will 
enable the option whether to process comment characters or not.
https://github.com/uniVocity/univocity-parsers/pull/412

in spark we need to add the respective option in CSVoptions class and csvutils 
class

proposed changes in spark

val df = 
spark.read.option("delimiter",",").option("*processComments*","false").csv("file:/E:/Data/Testdata.dat");

this *processComments *option when set to false spark should not check for any 
comment characters and will process all rows even if it started with null or 
any other comment character.

Please let me know if I can raise a PR on this, once this enhancement is 
accepted




> Support for treating the line as valid record if it starts with \u0000 or 
> null character, or starts with any character mentioned as comment
> -------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-32614
>                 URL: https://issues.apache.org/jira/browse/SPARK-32614
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.4.5, 3.0.0
>            Reporter: chanduhawk
>            Assignee: Jeff Evans
>            Priority: Major
>         Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u0000 or null character.Though user 
> can set any comment character other than \u0000, but there is a chance the 
> actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row 
> starts with null \u0000
> character it will throw the below error.
> *eg: *val df = 
> spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>       df.show(false);
> *+TestData+*
>  
>  !screenshot-1.png! 
> Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
>       at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
>       at 
> com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552)
>       at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160)
>       at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148)
>       at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
>       at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)
> *Note:*
> Though its the limitation of the univocity parser and the workaround is to 
> provide any other comment character by mentioning .option("comment","#"), but 
> if my actual data starts with this character then the particular row will be 
> discarded.
> Currently I pushed the code in univocity parser to handle this scenario as 
> part of the below PR
> https://github.com/uniVocity/univocity-parsers/pull/412
> please accept the jira so that we can enable this feature in spark-csv by 
> adding a parameter in spark csvoptions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

Reply via email to