[ https://issues.apache.org/jira/browse/SPARK-17168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430258#comment-15430258 ]
Sean Owen commented on SPARK-17168: ----------------------------------- It's a tough call. I can imagine for example a process ingesting lines of a huge CSV file and outputting them after some generic transformation. One file, with one header, may become many files ... of which only the first has a header. It's unclear whether that or having headers in every file is 'normal'. I'm not sure it's easy to implement, but I could imagine skipping the first line of any file that matches the first line of the first file. > CSV with header is incorrectly read if file is partitioned > ---------------------------------------------------------- > > Key: SPARK-17168 > URL: https://issues.apache.org/jira/browse/SPARK-17168 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: Mathieu D > Priority: Minor > > If a CSV file is stored in a partitioned fashion, the DataframeReader.csv > with option header set to true skips the first line of *each partition* > instead of skipping only the first one. > ex: > {code} > // create a partitioned CSV file with header : > val rdd=sc.parallelize(Seq("hdr","1","2","3","4","5","6"), numSlices=2) > rdd.saveAsTextFile("foo") > {code} > Now, if we try to read it with DataframeReader, the first row of the 2nd > partition is skipped. > {code} > val df = spark.read.option("header","true").csv("foo") > df.show > +---+ > |hdr| > +---+ > | 1| > | 2| > | 4| > | 5| > | 6| > +---+ > // one row is missing > {code} > I more or less understand that this is to be consistent with the save > operation of dataframewriter which saves header on each individual partition. > But this is very error-prone. In our case, we have large CSV files with > headers already stored in a partitioned way, so we will lose rows if we read > with header set to true. So we have to manually handle the headers. > I suggest a tri-valued option for header, with something like > "skipOnFirstPartition" -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org