[ https://issues.apache.org/jira/browse/SPARK-17168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429435#comment-15429435 ]
Takeshi Yamamuro commented on SPARK-17168: ------------------------------------------ Why is having a header in each partition error-prone? Seems this is intuitive to me. cc: [~hyukjin.kwon] > CSV with header is incorrectly read if file is partitioned > ---------------------------------------------------------- > > Key: SPARK-17168 > URL: https://issues.apache.org/jira/browse/SPARK-17168 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: Mathieu D > Priority: Minor > > If a CSV file is stored in a partitioned fashion, the DataframeReader.csv > with option header set to true skips the first line of *each partition* > instead of skipping only the first one. > ex: > {code} > // create a partitioned CSV file with header : > val rdd=sc.parallelize(Seq("hdr","1","2","3","4","5","6"), numSlices=2) > rdd.saveAsTextFile("foo") > {code} > Now, if we try to read it with DataframeReader, the first row of the 2nd > partition is skipped. > {code} > val df = spark.read.option("header","true").csv("foo") > df.show > +---+ > |hdr| > +---+ > | 1| > | 2| > | 4| > | 5| > | 6| > +---+ > // one row is missing > {code} > I more or less understand that this is to be consistent with the save > operation of dataframewriter which saves header on each individual partition. > But this is very error-prone. In our case, we have large CSV files with > headers already stored in a partitioned way, so we will lose rows if we read > with header set to true. So we have to manually handle the headers. > I suggest a tri-valued option for header, with something like > "skipOnFirstPartition" -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org