izchen commented on a change in pull request #29862:
URL: https://github.com/apache/spark/pull/29862#discussion_r494215490
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -93,6 +93,12 @@ object CSVUtils {
value
}
}
+ if (header.sameElements(row)) {
+ header
+ } else {
+ // Ensure that the newly generated and existing headers are not
duplicated.
+ makeSafeHeader(header, caseSensitive, options)
+ }
Review comment:
Thanks for your review.
R uses`.` as the delimiter and a non-repeated increasing number as the
suffix.
For example, the header is `a, a, a, a, a.2`
```R
> read.csv("x.csv", header = TRUE, sep = ",")
[1] a a.1 a.3 a.4 a.2
```
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -93,6 +93,12 @@ object CSVUtils {
value
}
}
+ if (header.sameElements(row)) {
+ header
+ } else {
+ // Ensure that the newly generated and existing headers are not
duplicated.
+ makeSafeHeader(header, caseSensitive, options)
+ }
Review comment:
Current behavior of Spark and R for CSV headers:
| CSV | SPARK | R |
| --------------- | ----------------- | --------------- |
| `a,a,a,a` | `a0,a1,a2,a3` | `a,a.1,a.2,a.3` |
| `a,,,` | `a,_c1,_c2,_c3` | `a,X,X.1,X.2` |
| *header: false* | `_c0,_c1,_c2,_c3` | `V1,V2,V3,V4` |
If we follow R's behavior, we will introduce a user-facing change. This may
cause errors in the user's legacy code.
Maybe we should keep the behavior of Spark.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]