Github user tejasapatil commented on the issue:
https://github.com/apache/spark/pull/16909
@hvanhovell : Re performance:
I ran a SMB join query over two real world tables (2 trillion rows (40 TB)
and 6 million rows (120 GB)). Note that this dataset does not have skew so no
spill happened. I saw improvement in CPU time by 2-4% over version without this
PR. This did not add up as I was expected some regression. I think allocating
array of capacity of 128 at the start (instead of starting with default size
16) is the sole reason for the perf. gain :
https://github.com/tejasapatil/spark/blob/SPARK-13450_smb_buffer_oom/sql/core/src/main/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArray.scala#L43
. I could remove that and rerun, but effectively the change will be deployed
in this form and I wanted to see the effect of it over large workload.
For micro-benchmarking, I took care of making the test case allocate array
buffers of same size to avoid this from interfering the results. The numbers
there look sane as there is some regression from using the spillable array.
I see testcases `DataFrameJoinSuite` failing over jenkins but they run fine
for me over intellij. Have you seen that before ? Just wanted to know if you if
its a KP before I invest more time on that.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]