GitHub user MaxGekk opened a pull request:
https://github.com/apache/spark/pull/21192
[SPARK-24118][SQL] Flexible format for the lineSep option of Text and JSON
datasources
## What changes were proposed in this pull request?
I propose flexible format for the **lineSep** option used in text
datasources like Json. New format of the option has the following syntax:
```
lineSep ::= (selector separator-spec) | text-separator
selector := 'x' | '\' | reserved-selector
reserved-selector ::= '\' | 'r'
separator-spec ::= < valid string literal in Python, R, Java and Scala>
text-separator ::= first-char separator-spec
first-char ::= ! selector
```
Examples of lineSep in the new format:
```
x0a.00.00.00 0d.00.00.00
x5445
|^|
\r\n
-
sep
```
The `'\'` and `'r'` are reserved for future usage. For instance, `'r'`
could be used for regular expressions line `r[0-9]+` or `r(x1E|x0Ax1E|x0A)` for
parsing [Json Streaming](https://en.wikipedia.org/wiki/JSON_streaming)
New format addresses the use cases:
1. Hexadecimal format allows to specify `lineSep` independently from
encoding. It gives opportunity for reading json files with BOM in per-line
mode. See https://github.com/apache/spark/pull/20849#issuecomment-377501993
2. Jsons coming usually from embedded systems have not-standard separators
(invisible in some cases). It is very convenient to open a file in hex editor
and copy bytes between }{ to the lineSep option. This is the use case for the
format with `'x'` selector like: `x0d 54 45`
3. In Json Streaming, records could be separated in pretty different ways.
We should leave room for improvement I believe. See `'r'` (for regexp) and
`'/'` reserved selectors
4. Some UTF-8 chars could cause errors from style (format) checkers. It is
easier to represent such chars in hexadecimal format instead of disabling the
checkers.
5. In near future, json datasource will support input json in different
charsets. If the source code in UTF-8 but input json in different charset, it
is slightly hard to put such chars as values for the lineSep option. The
`x<hexs>` format is more convenient here again.
## How was this patch tested?
The changes are checked by 2 new tests in which JSON files in `UTF-16` and
`UTF-32` with BOM are read. Also 2 new cases for an existing test are added.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MaxGekk/spark-1 json-flexible-line-sep2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21192.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21192
----
commit 60d5828df1b81b17eedf0bf5d307e4cef2f4453b
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-04-29T19:33:45Z
Flexible format of the lineSep option
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]