[GitHub] [spark] sadikovi commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

via GitHub Sun, 10 Sep 2023 20:28:29 -0700


sadikovi commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1713103692


   @dongjoon-hyun I reran the JSON benchmark and it seems like the previous 
results that I published were noisy. I confirmed there is no apparent 
regression in the patch.
   
   I ran only `Json files in the per-line mode` benchmark for 10 iterations. 
Results:
   
   Without the patch (latest master 
https://github.com/apache/spark/commit/eb0b09f0f2b518915421365a61d1f3d7d58b4404 
with the patch reverted):
   ```
   [info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on 
Linux 5.4.0-1045-aws
   [info] Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
   [info] Json files in the per-line mode:          Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] 
------------------------------------------------------------------------------------------------------------------------
   [info] Text read                                           463            
476          15         10.8          92.6       1.0X
   [info] Schema inferring                                   2126           
2166          48          2.4         425.1       0.2X
   [info] Parsing without charset                            3195           
3201           4          1.6         638.9       0.1X
   [info] Parsing with UTF-8                                 4129           
4140           8          1.2         825.8       0.1X
   ```
   
   With the patch (latest master):
   ```
   [info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on 
Linux 5.4.0-1045-aws
   [info] Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
   [info] Json files in the per-line mode:          Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] 
------------------------------------------------------------------------------------------------------------------------
   [info] Text read                                           459            
467           7         10.9          91.8       1.0X
   [info] Schema inferring                                   2159           
2198          45          2.3         431.7       0.2X
   [info] Parsing without charset                            3106           
3119          12          1.6         621.2       0.1X
   [info] Parsing with UTF-8                                 4071           
4090          10          1.2         814.2       0.1X
   ```
   
   It seems the results are approximately the same as before. However, the 
benchmark results tend to fluctuate quite a lot. For example, when I reran the 
same benchmark without any code changes: 
   
   
   The second run with the patch:
   ```
   [info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on 
Linux 5.4.0-1045-aws
   [info] Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
   [info] Json files in the per-line mode:          Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] 
------------------------------------------------------------------------------------------------------------------------
   [info] Text read                                           458            
469           9         10.9          91.5       1.0X
   [info] Schema inferring                                   2147           
2184          48          2.3         429.4       0.2X
   [info] Parsing without charset                            3294           
3308          10          1.5         658.8       0.1X
   [info] Parsing with UTF-8                                 4437           
4444           8          1.1         887.4       0.1X
   ```
   
   I think it is fine and it is just noise in the benchmark, no apparent 
regression because of the patch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sadikovi commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Reply via email to