Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/20849
@HyukjinKwon
> How about we go this way with separate PRs?
I agree with that only to unblock the
https://github.com/apache/spark/pull/20849 because it solves real problem of
our customers: reading a folder with many json files in UTF-16BE (without BOM)
in multiline mode. In this case, recordDelimiter (lineSep) is not required.
> #20877 to support line separator in json datasource
The PR doesn't solve any practical use cases because it doesn't address
[Json Streaming](https://en.wikipedia.org/wiki/JSON_streaming) and
https://github.com/apache/spark/pull/20877#issuecomment-375622342 . Also it is
useless in the case of reading jsons in charset different from UTF-8 in
per-line mode without the PR: https://github.com/apache/spark/pull/20849 . I
don't know what practical problem does it solves actually. In your tests you
check those delimiters:
https://github.com/apache/spark/pull/20877/files#diff-fde14032b0e6ef8086461edf79a27c5dR2112
. Are those delimiters from real jsons?
> json datasource with `encoding` option (forcing lineSep)
`encoding`? Only as an alias for `charset`. We have been already using
`charset` in our public release:
https://docs.azuredatabricks.net/spark/latest/data-sources/read-json.html#charset-auto-detection
. I will insist on the `charset` name for the option.
> flexible format PR with another review
ok. It could come as separate PR. The flexible format just leaves the room
for future extensions - nothing more. I would definitely discuss how are you
going to extend lineSep in your PR: https://github.com/apache/spark/pull/20877
in the future to support Json Streaming for example. If you don't have such
vision, I would prefer to block your PR.
/cc @gatorsmile @cloud-fan @hvanhovell @rxin
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]