[ https://issues.apache.org/jira/browse/SPARK-17168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429947#comment-15429947 ]
Hyukjin Kwon commented on SPARK-17168: -------------------------------------- Thanks for cc me [~maropu]! I also tend to agree CSV files are not made usually in this way but it seems we might need an option for handling this. (BTW, to be clear, it might be per-file not per-partition as we reduce the number of partitions via optimization, e.g. https://github.com/apache/spark/pull/12095). In more details, if my understanding is correct, each file is a complete self-containing format for all other data source. For example, I added the file extension for each part-file with this argument, here https://github.com/apache/spark/pull/11604 So, I think it'd make sense to keep (or do not keep) each header in every part-file identically. On the other hand, there are similar issues here https://github.com/databricks/spark-csv/issues/362 and https://github.com/databricks/spark-csv/issues/317 in spark-csv as an external library (those are about writing though). As far as I know, the reason that CSV was ported is to make users get easily into Spark. If this is helpful in user's perspective, I think we might need this option. > CSV with header is incorrectly read if file is partitioned > ---------------------------------------------------------- > > Key: SPARK-17168 > URL: https://issues.apache.org/jira/browse/SPARK-17168 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: Mathieu D > Priority: Minor > > If a CSV file is stored in a partitioned fashion, the DataframeReader.csv > with option header set to true skips the first line of *each partition* > instead of skipping only the first one. > ex: > {code} > // create a partitioned CSV file with header : > val rdd=sc.parallelize(Seq("hdr","1","2","3","4","5","6"), numSlices=2) > rdd.saveAsTextFile("foo") > {code} > Now, if we try to read it with DataframeReader, the first row of the 2nd > partition is skipped. > {code} > val df = spark.read.option("header","true").csv("foo") > df.show > +---+ > |hdr| > +---+ > | 1| > | 2| > | 4| > | 5| > | 6| > +---+ > // one row is missing > {code} > I more or less understand that this is to be consistent with the save > operation of dataframewriter which saves header on each individual partition. > But this is very error-prone. In our case, we have large CSV files with > headers already stored in a partitioned way, so we will lose rows if we read > with header set to true. So we have to manually handle the headers. > I suggest a tri-valued option for header, with something like > "skipOnFirstPartition" -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org