Github user ep1804 commented on the issue:
https://github.com/apache/spark/pull/17177
Thank you for early and detailed response @HyukjinKwon .
1. About the purpose of PR, Yes, it's about using escape-a-quote-escape
option. I used the wording 'encoding/decoding' with a general meaning and this
might be confusing.
2. The indentation errors and readability problem will be fixed following
your comments.
3. About the last question, I think this is a problem, and this is because
the uniVocity library doesn't work as expected, and I seek your advice. Here's
my experiment:
### Test code
For the experiment, I made an additional option `escapeEscape`.
```scala
val df1 = spark.sqlContext.createDataFrame(List(
(1, """AA"BB"""), // 1 quote char (OK without escapeEscape
option)
(2, """AA\"BB"""), // 1 escape char and 1 quote char
(3, """AA\\"BB"""), // 2 escape char and 1 quote char
(4, """AA""BB"""), // 2 quote char (OK without escapeEscape
option)
(5, """AA\"\"BB"""), // (1 escape char anc 1 quote char) * 2
(6, """AA\\"\\"BB""") // (2 escape char and 1 quote char) * 2
))
df1.coalesce(1).write
.format("csv")
.option("quote", "\"")
.option("escape", "\\")
.option("escapeEscape", "\\")
.save(csvDir)
val df2 = spark.read
.format("csv")
.option("quote", "\"")
.option("escape", "\\")
.option("escapeEscape", "\\")
.load(csvDir).orderBy($"_c0")
```
### Firstly, I set `escapeEscape` to `\`, as documented by uniVocity.
```
+---+----------+
| _1| _2|
+---+----------+
| 1| AA"BB|
| 2| AA\"BB|
| 3| AA\\"BB|
| 4| AA""BB|
| 5| AA\"\"BB|
| 6|AA\\"\\"BB|
+---+----------+
"AA\"BB"
"AA\\"BB"
"AA\\\"BB"
"AA\"\"BB"
"AA\\"\\\"BB"
"AA\\\"\\\\\"BB"
+---+---------+
|_c0| _c1|
+---+---------+
| 1| AA"BB|
| 2| "AA\"BB"|
| 3| AA\"BB|
| 4| AA""BB|
| 5| AA\\"BB|
| 6|AA\"\\"BB|
+---+---------+
```
In the result, the first table is df1, the second table is df2, and strings
between them are raw csv file content.
In the result, df1 and df2 are different. IMHO, this will not be a result
users expect.
### Secondly, I set `escapeEscape` to `A` to test what happens when the
escape character meets one pre-existing one.
```
+---+----------+
| _1| _2|
+---+----------+
| 1| AA"BB|
| 2| AA\"BB|
| 3| AA\\"BB|
| 4| AA""BB|
| 5| AA\"\"BB|
| 6|AA\\"\\"BB|
+---+----------+
"AA\"BB"
"AA\\"BB"
"AA\\\"BB"
"AA\"\"BB"
"AA\\"A\\"BB"
"AA\\\"A\A\\"BB"
+---+--------+
|_c0| _c1|
+---+--------+
| 1| "\"BB"|
| 2| \"BB|
| 3| \\"BB|
| 4| \"BB|
| 5| \"\"BB|
| 6|\\"\\"BB|
+---+--------+
```
Again, df1 and df2 are different. Something bad had happened.
### Thirdly, I tested with `escapeEscape = quote` setting
```
+---+----------+
| _1| _2|
+---+----------+
| 1| AA"BB|
| 2| AA\"BB|
| 3| AA\\"BB|
| 4| AA""BB|
| 5| AA\"\"BB|
| 6|AA\\"\\"BB|
+---+----------+
"AA\"BB"
"AA\\"BB"
"AA\\\"BB"
"AA\"\"BB"
"AA\\""\\"BB"
"AA\\\""\"\\"BB"
+---+----------+
|_c0| _c1|
+---+----------+
| 1| AA"BB|
| 2| AA\"BB|
| 3| AA\\"BB|
| 4| AA""BB|
| 5| AA\"\"BB|
| 6|AA\\"\\"BB|
+---+----------+
```
### Finally, I tested `escapeEscape = quote` case with more examples.
```scala
val df1 = spark.sqlContext.createDataFrame(List(
(1, """AA\"\"\"\"\"BB"""), // (1 escape char anc 1 quote char) * 5
(2, """AA\\"\\"\\"\\"\\"BB"""), // (2 escape char and 1 quote
char) * 5
(3, "You are \"beautiful\""),
(4, "Yes, \\\"inside\"\\")
))
```
And the result is OK
```
+---+-------------------+
| _1| _2|
+---+-------------------+
| 1| AA\"\"\"\"\"BB|
| 2|AA\\"\\"\\"\\"\\"BB|
| 3|You are "beautiful"|
| 4| Yes, \"inside"\|
+---+-------------------+
"AA\\""\\""\\""\\""\\"BB"
"AA\\\""\"\\""\"\\""\"\\""\"\\"BB"
"You are \"beautiful\""
"Yes, "\\"inside\""\"
+---+-------------------+
|_c0| _c1|
+---+-------------------+
| 1| AA\"\"\"\"\"BB|
| 2|AA\\"\\"\\"\\"\\"BB|
| 3|You are "beautiful"|
| 4| Yes, \"inside"\|
+---+-------------------+
```
This is why I used `quote` for `escapeEscape` parameter instead of `escape`
character.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]