[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

MaxGekk Wed, 28 Mar 2018 01:45:08 -0700

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/20849
  
    @HyukjinKwon 
    > How about we go this way with separate PRs?
    
    I agree with that only to unblock the 
https://github.com/apache/spark/pull/20849 because it solves real problem of 
our customers: reading a folder with many json files in UTF-16BE (without BOM) 
in multiline mode. In this case, recordDelimiter (lineSep) is not required.
    
     > #20877 to support line separator in json datasource
    
    The PR doesn't solve any practical use cases because it doesn't address 
[Json Streaming](https://en.wikipedia.org/wiki/JSON_streaming) and 
https://github.com/apache/spark/pull/20877#issuecomment-375622342 . Also it is 
useless in the case of reading jsons in charset different from UTF-8 in 
per-line mode without the PR: https://github.com/apache/spark/pull/20849 .  I 
don't know what practical problem does it solves actually. In your tests you 
check those delimiters: 
https://github.com/apache/spark/pull/20877/files#diff-fde14032b0e6ef8086461edf79a27c5dR2112
 . Are those delimiters from real jsons?
    
    > json datasource with `encoding` option (forcing lineSep)
    
    `encoding`? Only as an alias for `charset`. We have been already using 
`charset` in our public release: 
https://docs.azuredatabricks.net/spark/latest/data-sources/read-json.html#charset-auto-detection
 . I will insist on the `charset` name for the option.
    
    > flexible format PR with another review
    
    ok. It could come as separate PR. The flexible format just leaves the room 
for future extensions - nothing more. I would definitely discuss how are you 
going to extend lineSep in your PR: https://github.com/apache/spark/pull/20877 
in the future to support Json Streaming for example. If you don't have such 
vision, I would prefer to block your PR.
    
    /cc @gatorsmile @cloud-fan @hvanhovell @rxin




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

Reply via email to