Re: [I] Performance issue on TPCH comparing with sparksql [incubator-wayang]

via GitHub Wed, 27 Mar 2024 14:00:30 -0700


wangxiaoying commented on issue #423:
URL: 
https://github.com/apache/incubator-wayang/issues/423#issuecomment-2023979455


   Thank you @zkaoudi for the quick response!
   
   > In the meantime, can you confirm that the operations executed in postgres 
and in Spark with SparkSQL are the same when executed in Wayang?
   
   Yes. I checked the log at postgres side. Here is the query fetching related 
log when using wayang:
   ```
   2024-03-27 20:33:44.571 UTC [13283] LOG:  execute <unnamed>: SELECT 
l_orderkey, l_extendedprice, l_discount FROM LINEITEM WHERE l_shipDate > 
date('1995-03-15')
   2024-03-27 20:34:10.255 UTC [13284] LOG:  execute <unnamed>: BEGIN
   2024-03-27 20:34:10.256 UTC [13284] LOG:  execute <unnamed>: SET 
extra_float_digits = 3
   2024-03-27 20:34:10.256 UTC [13284] LOG:  execute <unnamed>: SET 
application_name = 'PostgreSQL JDBC Driver'
   2024-03-27 20:34:10.257 UTC [13284] LOG:  execute <unnamed>: COMMIT
   2024-03-27 20:34:10.258 UTC [13284] LOG:  execute <unnamed>: SELECT 
c_custkey FROM CUSTOMER WHERE c_mktsegment LIKE 'BUILDING%'
   2024-03-27 20:34:10.774 UTC [13285] LOG:  execute <unnamed>: BEGIN
   2024-03-27 20:34:10.775 UTC [13285] LOG:  execute <unnamed>: SET 
extra_float_digits = 3
   2024-03-27 20:34:10.775 UTC [13285] LOG:  execute <unnamed>: SET 
application_name = 'PostgreSQL JDBC Driver'
   2024-03-27 20:34:10.776 UTC [13285] LOG:  execute <unnamed>: COMMIT
   2024-03-27 20:34:10.810 UTC [13285] LOG:  execute <unnamed>: SELECT 
o_orderkey, o_custkey, o_orderdate, o_shippriority FROM ORDERS WHERE 
o_orderdate < date('1995-03-15')
   ```
   
   And this is the log when using sparksql:
   ```
   2024-03-27 20:37:25.668 UTC [13302] LOG:  execute <unnamed>: SELECT 
"c_custkey","c_mktsegment" FROM public.customer  WHERE ("c_custkey" IS NOT NULL)
   2024-03-27 20:37:25.701 UTC [13300] LOG:  execute <unnamed>: SELECT 
"o_orderkey","o_custkey","o_orderdate","o_shippriority" FROM public.orders  
WHERE ("o_orderdate" IS NOT NULL) AND ("o_orderdate" < '1995-03-15') AND 
("o_custkey" IS NOT NULL) AND ("o_orderkey" IS NOT NULL)
   2024-03-27 20:37:25.701 UTC [13301] LOG:  execute <unnamed>: SELECT 
"l_orderkey","l_extendedprice","l_discount" FROM public.lineitem  WHERE 
("l_shipdate" IS NOT NULL) AND ("l_shipdate" > '1995-03-15') AND ("l_orderkey" 
IS NOT NULL)
   ```
   
   In general, postgres does similar computation under the two setups. It seems 
like sparksql would generate additional filters with "IS NOT NULL", but it 
won't really filter our any data since the TPC-H dataset does not contain NULL 
value. In addition it didn't pushdown the `LIKE 'BUILDING%'` predicate as 
wayang does, which may cause more data to transfer for spark (although the 
table is not that big comparing to lineitem).
   
   **P.S.**
   Another thing I found when I check the log is that spark would issue the 
three sql queries in parallel, while wayang issue them one by one. I tried to 
enable parallelism in wayang by setting 
`wayang.core.optimizer.enumeration.parallel-tasks = true`, however it gives me 
an exception:
   ```
           Exception in thread "Thread-0" 
java.util.ConcurrentModificationException
        at java.base/java.util.HashMap.computeIfAbsent(HashMap.java:1134)
        at 
org.apache.wayang.core.platform.CrossPlatformExecutor.getOrCreateExecutorFor(CrossPlatformExecutor.java:391)
        at 
org.apache.wayang.core.platform.CrossPlatformExecutor$ParallelExecutionThread.run(CrossPlatformExecutor.java:1104)
        at java.base/java.lang.Thread.run(Thread.java:834)
   ```
   I think it is due to the racing on the `executors` member inside the 
`CrossPlatformExecutor.java`.
   
   This can be one of the reason for the performance difference, but I think 
the later execution difference inside spark platform is more significant in 
terms of the whole query.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Performance issue on TPCH comparing with sparksql [incubator-wayang]

Reply via email to