wangxiaoying commented on issue #423:
URL:
https://github.com/apache/incubator-wayang/issues/423#issuecomment-2023979455
Thank you @zkaoudi for the quick response!
> In the meantime, can you confirm that the operations executed in postgres
and in Spark with SparkSQL are the same when executed in Wayang?
Yes. I checked the log at postgres side. Here is the query fetching related
log when using wayang:
```
2024-03-27 20:33:44.571 UTC [13283] LOG: execute <unnamed>: SELECT
l_orderkey, l_extendedprice, l_discount FROM LINEITEM WHERE l_shipDate >
date('1995-03-15')
2024-03-27 20:34:10.255 UTC [13284] LOG: execute <unnamed>: BEGIN
2024-03-27 20:34:10.256 UTC [13284] LOG: execute <unnamed>: SET
extra_float_digits = 3
2024-03-27 20:34:10.256 UTC [13284] LOG: execute <unnamed>: SET
application_name = 'PostgreSQL JDBC Driver'
2024-03-27 20:34:10.257 UTC [13284] LOG: execute <unnamed>: COMMIT
2024-03-27 20:34:10.258 UTC [13284] LOG: execute <unnamed>: SELECT
c_custkey FROM CUSTOMER WHERE c_mktsegment LIKE 'BUILDING%'
2024-03-27 20:34:10.774 UTC [13285] LOG: execute <unnamed>: BEGIN
2024-03-27 20:34:10.775 UTC [13285] LOG: execute <unnamed>: SET
extra_float_digits = 3
2024-03-27 20:34:10.775 UTC [13285] LOG: execute <unnamed>: SET
application_name = 'PostgreSQL JDBC Driver'
2024-03-27 20:34:10.776 UTC [13285] LOG: execute <unnamed>: COMMIT
2024-03-27 20:34:10.810 UTC [13285] LOG: execute <unnamed>: SELECT
o_orderkey, o_custkey, o_orderdate, o_shippriority FROM ORDERS WHERE
o_orderdate < date('1995-03-15')
```
And this is the log when using sparksql:
```
2024-03-27 20:37:25.668 UTC [13302] LOG: execute <unnamed>: SELECT
"c_custkey","c_mktsegment" FROM public.customer WHERE ("c_custkey" IS NOT NULL)
2024-03-27 20:37:25.701 UTC [13300] LOG: execute <unnamed>: SELECT
"o_orderkey","o_custkey","o_orderdate","o_shippriority" FROM public.orders
WHERE ("o_orderdate" IS NOT NULL) AND ("o_orderdate" < '1995-03-15') AND
("o_custkey" IS NOT NULL) AND ("o_orderkey" IS NOT NULL)
2024-03-27 20:37:25.701 UTC [13301] LOG: execute <unnamed>: SELECT
"l_orderkey","l_extendedprice","l_discount" FROM public.lineitem WHERE
("l_shipdate" IS NOT NULL) AND ("l_shipdate" > '1995-03-15') AND ("l_orderkey"
IS NOT NULL)
```
In general, postgres does similar computation under the two setups. It seems
like sparksql would generate additional filters with "IS NOT NULL", but it
won't really filter our any data since the TPC-H dataset does not contain NULL
value. In addition it didn't pushdown the `LIKE 'BUILDING%'` predicate as
wayang does, which may cause more data to transfer for spark (although the
table is not that big comparing to lineitem).
**P.S.**
Another thing I found when I check the log is that spark would issue the
three sql queries in parallel, while wayang issue them one by one. I tried to
enable parallelism in wayang by setting
`wayang.core.optimizer.enumeration.parallel-tasks = true`, however it gives me
an exception:
```
Exception in thread "Thread-0"
java.util.ConcurrentModificationException
at java.base/java.util.HashMap.computeIfAbsent(HashMap.java:1134)
at
org.apache.wayang.core.platform.CrossPlatformExecutor.getOrCreateExecutorFor(CrossPlatformExecutor.java:391)
at
org.apache.wayang.core.platform.CrossPlatformExecutor$ParallelExecutionThread.run(CrossPlatformExecutor.java:1104)
at java.base/java.lang.Thread.run(Thread.java:834)
```
I think it is due to the racing on the `executors` member inside the
`CrossPlatformExecutor.java`.
This can be one of the reason for the performance difference, but I think
the later execution difference inside spark platform is more significant in
terms of the whole query.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]