[ 
https://issues.apache.org/jira/browse/SPARK-17168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430258#comment-15430258
 ] 

Sean Owen commented on SPARK-17168:
-----------------------------------

It's a tough call. I can imagine for example a process ingesting lines of a 
huge CSV file and outputting them after some generic transformation. One file, 
with one header, may become many files ... of which only the first has a 
header. It's unclear whether that or having headers in every file is 'normal'.

I'm not sure it's easy to implement, but I could imagine skipping the first 
line of any file that matches the first line of the first file. 

> CSV with header is incorrectly read if file is partitioned
> ----------------------------------------------------------
>
>                 Key: SPARK-17168
>                 URL: https://issues.apache.org/jira/browse/SPARK-17168
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Mathieu D
>            Priority: Minor
>
> If a CSV file is stored in a partitioned fashion, the DataframeReader.csv 
> with option header set to true skips the first line of *each partition* 
> instead of skipping only the first one.
> ex:
> {code}
> // create a partitioned CSV file with header : 
> val rdd=sc.parallelize(Seq("hdr","1","2","3","4","5","6"), numSlices=2)
> rdd.saveAsTextFile("foo")
> {code}
> Now, if we try to read it with DataframeReader, the first row of the 2nd 
> partition is skipped.
> {code}
> val df = spark.read.option("header","true").csv("foo")
> df.show
> +---+
> |hdr|
> +---+
> |  1|
> |  2|
> |  4|
> |  5|
> |  6|
> +---+
> // one row is missing
> {code}
> I more or less understand that this is to be consistent with the save 
> operation of dataframewriter which saves header on each individual partition.
> But this is very error-prone. In our case, we have large CSV files with 
> headers already stored in a partitioned way, so we will lose rows if we read 
> with header set to true. So we have to manually handle the headers.
> I suggest a tri-valued option for header, with something like 
> "skipOnFirstPartition"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to