[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

HyukjinKwon Wed, 28 Mar 2018 02:50:45 -0700

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20849
  
    > The PR doesn't solve any practical use cases 
    
    It does. It allows many workarounds, for example, we can intentionally add 
a custom delimiter so that it can support multiple-line-ish JSONs:
    
    ```
    {
      "a": 1
    }
    |^|
    {
      "b": 2
    }
    ```
    
    Go and google CSV's case too.
    
    > `encoding`? Only as an alias for `charset`.
    
    Yes, `encoding`. This has higher priority over `charset`. See `CSVOptions`. 
Also, that's what we use in PySpark's CSV, doesn't it?
    
    
https://github.com/apache/spark/blob/a9350d7095b79c8374fb4a06fd3f1a1a67615f6f/python/pyspark/sql/readwriter.py#L333
    
    Shall we expose `encoding` and add an alias for `charset`?
    
    > I would definitely discuss how are you going to extend lineSep in your 
PR: #20877 in the future to support Json Streaming for example. If you don't 
have such vision, I would prefer to block your PR.
    
    Why are you dragging an orthogonal thing into #20877? I don't think we 
would fail to make a decision on the flexible option I guess we have much time 
until 2.4.0. 
    
    Even if we fail to make a decision on the flexible option, we can expose 
another option that supports the flexibility that forces unsetting `lineSep`, 
can't we?
    
    Is this flexible option also a part of your public release?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

Reply via email to