sumitsu opened a new pull request #23665: [SPARK-26745][SQL] Skip empty lines 
in JSON-derived DataFrames when skipParsing optimization in effect
URL: https://github.com/apache/spark/pull/23665
 
 
   ## What changes were proposed in this pull request?
   
   This PR updates 
[`FailureSafeParser`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala)
 to allow text-input data sources to optionally specify a "fast" emptiness 
check for records, to be applied in cases where full parsing is disabled (i.e. 
where `skipParsing==true`: non-multiline + permissive-mode + empty schema).
   
   
[`TextInputJsonDataSource`](https://github.com/apache/spark/blob/7a83d71/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L87)
 is updated such that it creates `FailureSafeParser` with an emptiness check 
which filters out blank (or all-whitespace) lines. This behavior resolves 
**[SPARK-26745](https://issues.apache.org/jira/browse/SPARK-26745)** by 
preventing `count()` from including blank lines (which the full parser ignores) 
under conditions where `skipParsing` is enabled.
   
   ## How was this patch tested?
   
   Existing `JsonSuite` unit tests, supplemented by a new test case which 
verifies that pre-parsing and post-parsing `count()` values are equal for 
JSON-derived DataFrames.
   
   ### `JsonBenchmark` performance test results
   
   The 
[`JsonBenchmark`](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala)
 performance suite was executed on the following branches:
   
   - this PR (branch: `sumitsu/spark:json_emptyline_count`)
   - `apache/spark:master` branch
   - `apache/spark:master` branch, modified to **not** use the 
[SPARK-24959](https://issues.apache.org/jira/browse/SPARK-24959) optimization
   
   The no-optimization code base was simulated by hard-coding the 
[`skipParsing` flag in 
`FailureSafeParser`](https://github.com/apache/spark/blob/11e5f1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala#L54)
 to `false` for the test.
   
   Compared with the no-optimization scenario, this PR appears to preserve most 
of the [SPARK-24959](https://issues.apache.org/jira/browse/SPARK-24959) 
optimization performance gains, but there is a small performance regression 
compared with `master`.
   
   Summary charts:
   
   - [all 
benchmarks](https://drive.google.com/open?id=1-ess-XSaOnMOI7NgFbIbhsnblbL3bH5e)
   - [count-without-select benchmarks 
only](https://drive.google.com/open?id=1iJyd_ju31wFvNP-65MwNdqaR3uCmLLQJ)
   
   test environment:
   ```
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_162-b12 on Mac OS X 10.12.6
   Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
   ```
   
   #### with changes in this PR (branch: `sumitsu/spark:json_emptyline_count`)
   
   ```
   JSON schema inferring:                   Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------
   No encoding                               397395 / 422450          0.3       
 3973.9       1.0X
   UTF-8 is set                              430505 / 436580          0.2       
 4305.1       0.9X
   
   count a short column:                    Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------
   No encoding                                 18986 / 19018          5.3       
  189.9       1.0X
   UTF-8 is set                                18848 / 18954          5.3       
  188.5       1.0X
   
   count a wide column:                     Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------
   No encoding                                 39076 / 39130          0.3       
 3907.6       1.0X
   UTF-8 is set                                39383 / 39455          0.3       
 3938.3       1.0X
   
   Select a subset of 10 columns:           Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------
   Select 10 columns + count()                 14586 / 14904          0.7       
 1458.6       1.0X
   Select 1 column + count()                   10969 / 10992          0.9       
 1096.9       1.3X
   count()                                       2740 / 2755          3.6       
  274.0       5.3X
   
   creation of JSON parser per line:        Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------
   Short column without encoding                 6822 / 6870          1.5       
  682.2       1.0X
   Short column with UTF-8                       8901 / 8937          1.1       
  890.1       0.8X
   Wide column without encoding              140199 / 140659          0.1       
14019.9       0.0X
   Wide column with UTF-8                    158228 / 158439          0.1       
15822.8       0.0X
   ```
   
   #### `apache/spark:master` branch
   
   ```
   JSON schema inferring:                   Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------
   No encoding                               376210 / 378100          0.3       
 3762.1       1.0X
   UTF-8 is set                              410952 / 414711          0.2       
 4109.5       0.9X
   
   count a short column:                    Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------
   No encoding                                 12871 / 12904          7.8       
  128.7       1.0X
   UTF-8 is set                                12857 / 12932          7.8       
  128.6       1.0X
   
   count a wide column:                     Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------
   No encoding                                 38650 / 38680          0.3       
 3865.0       1.0X
   UTF-8 is set                                38751 / 38774          0.3       
 3875.1       1.0X
   
   Select a subset of 10 columns:           Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------
   Select 10 columns + count()                 14570 / 14986          0.7       
 1457.0       1.0X
   Select 1 column + count()                   11410 / 11757          0.9       
 1141.0       1.3X
   count()                                       2346 / 2367          4.3       
  234.6       6.2X
   
   creation of JSON parser per line:        Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------
   Short column without encoding                 6596 / 6708          1.5       
  659.6       1.0X
   Short column with UTF-8                       8867 / 8902          1.1       
  886.7       0.7X
   Wide column without encoding              139712 / 139725          0.1       
13971.2       0.0X
   Wide column with UTF-8                    156809 / 156832          0.1       
15680.9       0.0X
   ```
   
   #### optimization disabled
   
   ```
   JSON schema inferring:                   Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------
   No encoding                               375309 / 376301          0.3       
 3753.1       1.0X
   UTF-8 is set                              442666 / 448741          0.2       
 4426.7       0.8X
   
   count a short column:                    Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------
   No encoding                                 39014 / 39036          2.6       
  390.1       1.0X
   UTF-8 is set                                66988 / 67107          1.5       
  669.9       0.6X
   
   count a wide column:                     Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------
   No encoding                                 62555 / 62712          0.2       
 6255.5       1.0X
   UTF-8 is set                                85354 / 85509          0.1       
 8535.4       0.7X
   
   Select a subset of 10 columns:           Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------
   Select 10 columns + count()                 17173 / 17249          0.6       
 1717.3       1.0X
   Select 1 column + count()                   11503 / 11514          0.9       
 1150.3       1.5X
   count()                                     13806 / 13849          0.7       
 1380.6       1.2X
   
   creation of JSON parser per line:        Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------
   Short column without encoding                 6388 / 6432          1.6       
  638.8       1.0X
   Short column with UTF-8                       8910 / 8923          1.1       
  891.0       0.7X
   Wide column without encoding              135854 / 135964          0.1       
13585.4       0.0X
   Wide column with UTF-8                    154108 / 154186          0.1       
15410.8       0.0X
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to