[
https://issues.apache.org/jira/browse/SPARK-17168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-17168.
----------------------------------
Resolution: Incomplete
> CSV with header is incorrectly read if file is partitioned
> ----------------------------------------------------------
>
> Key: SPARK-17168
> URL: https://issues.apache.org/jira/browse/SPARK-17168
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Mathieu DESPRIEE
> Priority: Minor
> Labels: bulk-closed
>
> If a CSV file is stored in a partitioned fashion, the DataframeReader.csv
> with option header set to true skips the first line of *each partition*
> instead of skipping only the first one.
> ex:
> {code}
> // create a partitioned CSV file with header :
> val rdd=sc.parallelize(Seq("hdr","1","2","3","4","5","6"), numSlices=2)
> rdd.saveAsTextFile("foo")
> {code}
> Now, if we try to read it with DataframeReader, the first row of the 2nd
> partition is skipped.
> {code}
> val df = spark.read.option("header","true").csv("foo")
> df.show
> +---+
> |hdr|
> +---+
> | 1|
> | 2|
> | 4|
> | 5|
> | 6|
> +---+
> // one row is missing
> {code}
> I more or less understand that this is to be consistent with the save
> operation of dataframewriter which saves header on each individual partition.
> But this is very error-prone. In our case, we have large CSV files with
> headers already stored in a partitioned way, so we will lose rows if we read
> with header set to true. So we have to manually handle the headers.
> I suggest a tri-valued option for header, with something like
> "skipOnFirstPartition"
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]