Mathieu D created SPARK-17168:
---------------------------------

             Summary: CSV with header is incorrectly read if file is partitioned
                 Key: SPARK-17168
                 URL: https://issues.apache.org/jira/browse/SPARK-17168
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Mathieu D
            Priority: Minor


If a CSV file is stored in a partitioned fashion, the DataframeReader.csv with 
option header set to true skips the first line of *each partition* instead of 
skipping only the first one.

ex:
{code}
// create a partitioned CSV file with header : 
val rdd=sc.parallelize(Seq("hdr","1","2","3","4","5","6"), numSlices=2)
rdd.saveAsTextFile("foo")
{code}

Now, if we try to read it with DataframeReader, the first row of the 2nd 
partition is skipped.

{code}
val df = spark.read.option("header","true").csv("foo")
df.show
+---+
|hdr|
+---+
|  1|
|  2|
|  4|
|  5|
|  6|
+---+
// one row is missing
{code}

I more or less understand that this is to be consistent with the save operation 
of dataframewriter which saves header on each individual partition.
But this is very error-prone. In our case, we have large CSV files with headers 
already stored in a partitioned way, so we will lose rows if we read with 
header set to true. So we have to manually handle the headers.

I suggest a tri-valued option for header, with something like 
"skipOnFirstPartition"




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to