sumitsu opened a new pull request #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect URL: https://github.com/apache/spark/pull/23665 ## What changes were proposed in this pull request? This PR updates [`FailureSafeParser`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala) to allow text-input data sources to optionally specify a "fast" emptiness check for records, to be applied in cases where full parsing is disabled (i.e. where `skipParsing==true`: non-multiline + permissive-mode + empty schema). [`TextInputJsonDataSource`](https://github.com/apache/spark/blob/7a83d71/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L87) is updated such that it creates `FailureSafeParser` with an emptiness check which filters out blank (or all-whitespace) lines. This behavior resolves **[SPARK-26745](https://issues.apache.org/jira/browse/SPARK-26745)** by preventing `count()` from including blank lines (which the full parser ignores) under conditions where `skipParsing` is enabled. ## How was this patch tested? Existing `JsonSuite` unit tests, supplemented by a new test case which verifies that pre-parsing and post-parsing `count()` values are equal for JSON-derived DataFrames. ### `JsonBenchmark` performance test results The [`JsonBenchmark`](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala) performance suite was executed on the following branches: - this PR (branch: `sumitsu/spark:json_emptyline_count`) - `apache/spark:master` branch - `apache/spark:master` branch, modified to **not** use the [SPARK-24959](https://issues.apache.org/jira/browse/SPARK-24959) optimization The no-optimization code base was simulated by hard-coding the [`skipParsing` flag in `FailureSafeParser`](https://github.com/apache/spark/blob/11e5f1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala#L54) to `false` for the test. Compared with the no-optimization scenario, this PR appears to preserve most of the [SPARK-24959](https://issues.apache.org/jira/browse/SPARK-24959) optimization performance gains, but there is a small performance regression compared with `master`. Summary charts: - [all benchmarks](https://drive.google.com/open?id=1-ess-XSaOnMOI7NgFbIbhsnblbL3bH5e) - [count-without-select benchmarks only](https://drive.google.com/open?id=1iJyd_ju31wFvNP-65MwNdqaR3uCmLLQJ) test environment: ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_162-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz ``` #### with changes in this PR (branch: `sumitsu/spark:json_emptyline_count`) ``` JSON schema inferring: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ No encoding 397395 / 422450 0.3 3973.9 1.0X UTF-8 is set 430505 / 436580 0.2 4305.1 0.9X count a short column: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ No encoding 18986 / 19018 5.3 189.9 1.0X UTF-8 is set 18848 / 18954 5.3 188.5 1.0X count a wide column: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ No encoding 39076 / 39130 0.3 3907.6 1.0X UTF-8 is set 39383 / 39455 0.3 3938.3 1.0X Select a subset of 10 columns: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Select 10 columns + count() 14586 / 14904 0.7 1458.6 1.0X Select 1 column + count() 10969 / 10992 0.9 1096.9 1.3X count() 2740 / 2755 3.6 274.0 5.3X creation of JSON parser per line: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Short column without encoding 6822 / 6870 1.5 682.2 1.0X Short column with UTF-8 8901 / 8937 1.1 890.1 0.8X Wide column without encoding 140199 / 140659 0.1 14019.9 0.0X Wide column with UTF-8 158228 / 158439 0.1 15822.8 0.0X ``` #### `apache/spark:master` branch ``` JSON schema inferring: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ No encoding 376210 / 378100 0.3 3762.1 1.0X UTF-8 is set 410952 / 414711 0.2 4109.5 0.9X count a short column: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ No encoding 12871 / 12904 7.8 128.7 1.0X UTF-8 is set 12857 / 12932 7.8 128.6 1.0X count a wide column: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ No encoding 38650 / 38680 0.3 3865.0 1.0X UTF-8 is set 38751 / 38774 0.3 3875.1 1.0X Select a subset of 10 columns: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Select 10 columns + count() 14570 / 14986 0.7 1457.0 1.0X Select 1 column + count() 11410 / 11757 0.9 1141.0 1.3X count() 2346 / 2367 4.3 234.6 6.2X creation of JSON parser per line: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Short column without encoding 6596 / 6708 1.5 659.6 1.0X Short column with UTF-8 8867 / 8902 1.1 886.7 0.7X Wide column without encoding 139712 / 139725 0.1 13971.2 0.0X Wide column with UTF-8 156809 / 156832 0.1 15680.9 0.0X ``` #### optimization disabled ``` JSON schema inferring: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ No encoding 375309 / 376301 0.3 3753.1 1.0X UTF-8 is set 442666 / 448741 0.2 4426.7 0.8X count a short column: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ No encoding 39014 / 39036 2.6 390.1 1.0X UTF-8 is set 66988 / 67107 1.5 669.9 0.6X count a wide column: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ No encoding 62555 / 62712 0.2 6255.5 1.0X UTF-8 is set 85354 / 85509 0.1 8535.4 0.7X Select a subset of 10 columns: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Select 10 columns + count() 17173 / 17249 0.6 1717.3 1.0X Select 1 column + count() 11503 / 11514 0.9 1150.3 1.5X count() 13806 / 13849 0.7 1380.6 1.2X creation of JSON parser per line: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Short column without encoding 6388 / 6432 1.6 638.8 1.0X Short column with UTF-8 8910 / 8923 1.1 891.0 0.7X Wide column without encoding 135854 / 135964 0.1 13585.4 0.0X Wide column with UTF-8 154108 / 154186 0.1 15410.8 0.0X ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
