GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/21192

    [SPARK-24118][SQL] Flexible format for the lineSep option of Text and JSON 
datasources

    ## What changes were proposed in this pull request?
    
    I propose flexible format for the **lineSep** option used in text 
datasources like Json. New format of the option has the following syntax:
    
    ```
    lineSep ::= (selector separator-spec) | text-separator
    selector := 'x' | '\' | reserved-selector
    reserved-selector ::= '\' | 'r'
    separator-spec ::= < valid string literal in Python, R, Java and Scala>
    text-separator ::= first-char separator-spec
    first-char ::= ! selector
    ```
    
    Examples of lineSep in the new format:
    
    ```
    x0a.00.00.00 0d.00.00.00
    x5445
    |^|
    \r\n
    -
    sep
    ```
    The `'\'` and `'r'` are reserved for future usage. For instance, `'r'` 
could be used for regular expressions line `r[0-9]+` or `r(x1E|x0Ax1E|x0A)` for 
parsing [Json Streaming](https://en.wikipedia.org/wiki/JSON_streaming)
    
    New format addresses the use cases:
    
    1. Hexadecimal format allows to specify `lineSep` independently from 
encoding. It gives opportunity for reading json files with BOM in per-line 
mode. See https://github.com/apache/spark/pull/20849#issuecomment-377501993
    
    2. Jsons coming usually from embedded systems have not-standard separators 
(invisible in some cases). It is very convenient to open a file in hex editor 
and copy bytes between }{ to the lineSep option. This is the use case for the 
format with `'x'` selector like: `x0d 54 45`
    
    3. In Json Streaming, records could be separated in pretty different ways. 
We should leave room for improvement I believe. See `'r'` (for regexp) and 
`'/'` reserved selectors
    
    4. Some UTF-8 chars could cause errors from style (format) checkers. It is 
easier to represent such chars in hexadecimal format instead of disabling the 
checkers.
    
    5. In near future, json datasource will support input json in different 
charsets. If the source code in UTF-8 but input json in different charset, it 
is slightly hard to put such chars as values for the lineSep option. The 
`x<hexs>` format is more convenient here again. 
    
    
    ## How was this patch tested?
    
    The changes are checked by 2 new tests in which JSON files in `UTF-16` and 
`UTF-32` with BOM are read. Also 2 new cases for an existing test are added.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 json-flexible-line-sep2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21192.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21192
    
----
commit 60d5828df1b81b17eedf0bf5d307e4cef2f4453b
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-04-29T19:33:45Z

    Flexible format of the lineSep option

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to