[GitHub] [spark] izchen commented on a change in pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

GitBox Fri, 25 Sep 2020 06:23:45 -0700


izchen commented on a change in pull request #29862:
URL: https://github.com/apache/spark/pull/29862#discussion_r494215490




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -93,6 +93,12 @@ object CSVUtils {
           value
         }
       }
+      if (header.sameElements(row)) {
+        header
+      } else {
+        // Ensure that the newly generated and existing headers are not 
duplicated.
+        makeSafeHeader(header, caseSensitive, options)
+      }

Review comment:
       Thanks for your review.
   
   R uses`.` as the delimiter and a non-repeated increasing number as the 
suffix.
   For example, the header is `a, a, a, a, a.2`
   ```R
   > read.csv("x.csv", header = TRUE, sep = ",")
   [1] a   a.1 a.3 a.4 a.2
   ```

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -93,6 +93,12 @@ object CSVUtils {
           value
         }
       }
+      if (header.sameElements(row)) {
+        header
+      } else {
+        // Ensure that the newly generated and existing headers are not 
duplicated.
+        makeSafeHeader(header, caseSensitive, options)
+      }

Review comment:
       Current behavior of Spark and R for CSV headers:
   
   | CSV             | SPARK             | R               |
   | --------------- | ----------------- | --------------- |
   | `a,a,a,a`       | `a0,a1,a2,a3`     | `a,a.1,a.2,a.3` |
   | `a,,,`          | `a,_c1,_c2,_c3`   | `a,X,X.1,X.2`   |
   | *header: false* | `_c0,_c1,_c2,_c3` | `V1,V2,V3,V4`   |
   
   If we follow R's behavior, we will introduce a user-facing change. This may 
cause errors in the user's legacy code. 
   Maybe we should keep the behavior of Spark.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] izchen commented on a change in pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Reply via email to