[GitHub] spark issue #17177: [SPARK-19834][SQL] csv encoding/decoding using escape of...

ep1804 Mon, 06 Mar 2017 19:02:38 -0800

Github user ep1804 commented on the issue:

    https://github.com/apache/spark/pull/17177
  
    Thank you for early and detailed response @HyukjinKwon .
    
    1. About the purpose of PR, Yes, it's about using escape-a-quote-escape 
option. I used the wording 'encoding/decoding' with a general meaning and this 
might be confusing.
    
    2. The indentation errors and readability problem will be fixed following 
your comments.
    
    3. About the last question, I think this is a problem, and this is because 
the uniVocity library doesn't work as expected, and I seek your advice. Here's 
my experiment:
    
    ### Test code
    
    For the experiment, I made an additional option `escapeEscape`.
    
    ```scala
          val df1 = spark.sqlContext.createDataFrame(List(
            (1, """AA"BB"""),      // 1 quote char (OK without escapeEscape 
option)
            (2, """AA\"BB"""),     // 1 escape char and 1 quote char
            (3, """AA\\"BB"""),    // 2 escape char and 1 quote char
            (4, """AA""BB"""),     // 2 quote char (OK without escapeEscape 
option)
            (5, """AA\"\"BB"""),   // (1 escape char anc 1 quote char) * 2
            (6, """AA\\"\\"BB""")  // (2 escape char and 1 quote char) * 2
          ))
    
          df1.coalesce(1).write
            .format("csv")
            .option("quote", "\"")
            .option("escape", "\\")
            .option("escapeEscape", "\\")
            .save(csvDir)
    
          val df2 = spark.read
            .format("csv")
            .option("quote", "\"")
            .option("escape", "\\")
            .option("escapeEscape", "\\")
            .load(csvDir).orderBy($"_c0")
    
    ```
    ### Firstly, I set `escapeEscape` to `\`, as documented by uniVocity.
    
    ```
    +---+----------+
    | _1|        _2|
    +---+----------+
    |  1|     AA"BB|
    |  2|    AA\"BB|
    |  3|   AA\\"BB|
    |  4|    AA""BB|
    |  5|  AA\"\"BB|
    |  6|AA\\"\\"BB|
    +---+----------+
    
    "AA\"BB"
    "AA\\"BB"
    "AA\\\"BB"
    "AA\"\"BB"
    "AA\\"\\\"BB"
    "AA\\\"\\\\\"BB"
    
    +---+---------+
    |_c0|      _c1|
    +---+---------+
    |  1|    AA"BB|
    |  2| "AA\"BB"|
    |  3|   AA\"BB|
    |  4|   AA""BB|
    |  5|  AA\\"BB|
    |  6|AA\"\\"BB|
    +---+---------+
    ```
    In the result, the first table is df1, the second table is df2, and strings 
between them are raw csv file content.
    
    In the result, df1 and df2 are different. IMHO, this will not be a result 
users expect.
    
    
    ### Secondly, I set `escapeEscape` to `A` to test what happens when the 
escape character meets one pre-existing one.
    ```
    +---+----------+
    | _1|        _2|
    +---+----------+
    |  1|     AA"BB|
    |  2|    AA\"BB|
    |  3|   AA\\"BB|
    |  4|    AA""BB|
    |  5|  AA\"\"BB|
    |  6|AA\\"\\"BB|
    +---+----------+
    
    "AA\"BB"
    "AA\\"BB"
    "AA\\\"BB"
    "AA\"\"BB"
    "AA\\"A\\"BB"
    "AA\\\"A\A\\"BB"
    
    +---+--------+
    |_c0|     _c1|
    +---+--------+
    |  1|  "\"BB"|
    |  2|    \"BB|
    |  3|   \\"BB|
    |  4|    \"BB|
    |  5|  \"\"BB|
    |  6|\\"\\"BB|
    +---+--------+
    ```
    
    Again, df1 and df2 are different. Something bad had happened.
    
    ### Thirdly, I tested with `escapeEscape = quote` setting
    
    ```
    +---+----------+
    | _1|        _2|
    +---+----------+
    |  1|     AA"BB|
    |  2|    AA\"BB|
    |  3|   AA\\"BB|
    |  4|    AA""BB|
    |  5|  AA\"\"BB|
    |  6|AA\\"\\"BB|
    +---+----------+
    
    "AA\"BB"
    "AA\\"BB"
    "AA\\\"BB"
    "AA\"\"BB"
    "AA\\""\\"BB"
    "AA\\\""\"\\"BB"
    
    +---+----------+
    |_c0|       _c1|
    +---+----------+
    |  1|     AA"BB|
    |  2|    AA\"BB|
    |  3|   AA\\"BB|
    |  4|    AA""BB|
    |  5|  AA\"\"BB|
    |  6|AA\\"\\"BB|
    +---+----------+
    ```
    
    
    ### Finally, I tested `escapeEscape = quote` case with more examples.
    
    ```scala
          val df1 = spark.sqlContext.createDataFrame(List(
            (1, """AA\"\"\"\"\"BB"""),   // (1 escape char anc 1 quote char) * 5
            (2, """AA\\"\\"\\"\\"\\"BB"""),  // (2 escape char and 1 quote 
char) * 5
            (3, "You are \"beautiful\""),
            (4, "Yes, \\\"inside\"\\")
          ))
    ```
    
    And the result is OK
    
    ```
    +---+-------------------+
    | _1|                 _2|
    +---+-------------------+
    |  1|     AA\"\"\"\"\"BB|
    |  2|AA\\"\\"\\"\\"\\"BB|
    |  3|You are "beautiful"|
    |  4|    Yes, \"inside"\|
    +---+-------------------+
    
    "AA\\""\\""\\""\\""\\"BB"
    "AA\\\""\"\\""\"\\""\"\\""\"\\"BB"
    "You are \"beautiful\""
    "Yes, "\\"inside\""\"
    
    +---+-------------------+
    |_c0|                _c1|
    +---+-------------------+
    |  1|     AA\"\"\"\"\"BB|
    |  2|AA\\"\\"\\"\\"\\"BB|
    |  3|You are "beautiful"|
    |  4|    Yes, \"inside"\|
    +---+-------------------+
    ```
    
    This is why I used `quote` for `escapeEscape` parameter instead of `escape` 
character.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #17177: [SPARK-19834][SQL] csv encoding/decoding using escape of...

Reply via email to