[ 
https://issues.apache.org/jira/browse/SPARK-17168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429947#comment-15429947
 ] 

Hyukjin Kwon commented on SPARK-17168:
--------------------------------------

Thanks for cc me [~maropu]! I also tend to agree CSV files are not made usually 
in this way but it seems we might need an option for handling this. (BTW, to be 
clear, it might be per-file not per-partition as we reduce the number of 
partitions via optimization, e.g. https://github.com/apache/spark/pull/12095).

In more details, if my understanding is correct, each file is a complete 
self-containing format for all other data source. For example, I added the file 
extension for each part-file with this argument, here 
https://github.com/apache/spark/pull/11604 So, I think it'd make sense to keep 
(or do not keep) each header in every part-file identically.

On the other hand, there are similar issues here 
https://github.com/databricks/spark-csv/issues/362 and 
https://github.com/databricks/spark-csv/issues/317 in spark-csv as an external 
library (those are about writing though).
As far as I know, the reason that CSV was ported is to make users get easily 
into Spark. If this is helpful in user's perspective, I think we might need 
this option.

> CSV with header is incorrectly read if file is partitioned
> ----------------------------------------------------------
>
>                 Key: SPARK-17168
>                 URL: https://issues.apache.org/jira/browse/SPARK-17168
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Mathieu D
>            Priority: Minor
>
> If a CSV file is stored in a partitioned fashion, the DataframeReader.csv 
> with option header set to true skips the first line of *each partition* 
> instead of skipping only the first one.
> ex:
> {code}
> // create a partitioned CSV file with header : 
> val rdd=sc.parallelize(Seq("hdr","1","2","3","4","5","6"), numSlices=2)
> rdd.saveAsTextFile("foo")
> {code}
> Now, if we try to read it with DataframeReader, the first row of the 2nd 
> partition is skipped.
> {code}
> val df = spark.read.option("header","true").csv("foo")
> df.show
> +---+
> |hdr|
> +---+
> |  1|
> |  2|
> |  4|
> |  5|
> |  6|
> +---+
> // one row is missing
> {code}
> I more or less understand that this is to be consistent with the save 
> operation of dataframewriter which saves header on each individual partition.
> But this is very error-prone. In our case, we have large CSV files with 
> headers already stored in a partitioned way, so we will lose rows if we read 
> with header set to true. So we have to manually handle the headers.
> I suggest a tri-valued option for header, with something like 
> "skipOnFirstPartition"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to