GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/20885

    [SPARK-23724][SPARK-23765][SQL] Line separator for the json datasource

    ## What changes were proposed in this pull request?
    
    Currently, 
[TextInputJsonDataSource](https://github.com/databricks/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L86)
 uses 
[HadoopFileLinesReader](https://github.com/databricks/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L125)
 to split json file to separate lines. The former one 
[splits](https://github.com/databricks/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L125)
 json lines by 
[LineRecordReader](https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L68)
 without providing recordDelimiter. As a consequence of that, the hadoop 
library [reads lines terminated by one of CR, LF, or 
CRLF](https://github.com/apache/hadoop/blob/trunk/hadoop-common-pro
 
ject/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L185-L254).
 The changes allow to specify the line separator instead of using the auto 
detection method of hadoop library.  If the separator is not specified, the 
line separation method of Hadoop is used by default.
    
    ## How was this patch tested?
    
    Added new tests for writing/reading json files with custom line separator

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 json-line-sep

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20885.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20885
    
----
commit a794988407b6fd28364f5d993a6a52ac0b85ec5f
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-24T20:11:00Z

    Adding the delimiter option encoded in base64

commit dccdaa2e97cb4e2f6f8ea7e03320cdb05a43668c
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-24T20:59:46Z

    Separator encoded as a sequence of bytes in hex

commit d0abab7e4b74dd42e06972f9484bc712b8f11c63
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-24T21:06:08Z

    Refactoring: removed unused imports and renaming a parameter

commit 674179601b4c82e315eb1156df0f3f5035e91154
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-04T17:24:42Z

    The sep option is renamed to recordSeparator. The supported format is 
sequence of bytes in hex like x0d 0a

commit e4faae155cb5b0761da9ac72a12f67cdde6b2e6b
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-18T12:40:21Z

    Renaming recordSeparator to recordDelimiter

commit 01f4ef584a2cc1ce460359f260ebbe22808d034e
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-18T13:17:59Z

    Comments for the recordDelimiter option

commit 24cedb9d809b026fa36b01fb2b425918b43857df
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-18T14:36:31Z

    Support other formats of recordDelimiter

commit d40dda22587deaf79cfad3b20ccf6854554fc11d
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-18T16:30:26Z

    Checking different charsets and record delimiters

commit ad6496c6d9415bcd49630272b5d6c327ffcb1378
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-18T16:39:07Z

    Renaming test's method to make it more readable

commit 358863d91bf0c0d9761aa13698eb7f8532e5fc90
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-18T17:20:38Z

    Test of reading json in different charsets and delimiters

commit 7e5be5e2b4cf7f77914a0d91e74ea31ab8c272d0
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-18T20:25:47Z

    Fix inferring of csv schema for any charsets

commit d138d2d4e7b6e0c3e46d73939ff06a875128d59d
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-18T21:02:44Z

    Fix errors of scalastyle check

commit c26ef5d3d2a3970c80c973eec696805929bd7725
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-22T11:20:34Z

    Reserving format for regular expressions and concatenated json

commit 5f0b0694f142bd69127c8991d83a24f528316b2b
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-22T20:18:21Z

    Fix recordDelimiter tests

commit ef8248f862949becdb3d370ac94a1cfc1f7c3068
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-22T20:34:56Z

    Additional cases are added to the delimiter test

commit 2efac082ea4e40b89b4d01274851c0dcdd49eb44
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-22T21:01:56Z

    Renaming recordDelimiter to lineSeparator

commit b2020fa99584d03e1754a4a1b5991dce4875f448
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-22T21:38:33Z

    Adding HyukjinKwon changes

commit f99c1e16f2ad90c2a94e8c4b206b5b740506e136
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-22T22:23:21Z

    Revert lineSepInWrite back to lineSep

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to