[GitHub] spark issue #16909: [SPARK-13450] Introduce ExternalAppendOnlyUnsafeRowArray...

tejasapatil Tue, 14 Feb 2017 22:52:28 -0800

Github user tejasapatil commented on the issue:

    https://github.com/apache/spark/pull/16909
  
    @hvanhovell : Re performance:
    
    I ran a SMB join query over two real world tables (2 trillion rows (40 TB) 
and 6 million rows (120 GB)). Note that this dataset does not have skew so no 
spill happened. I saw improvement in CPU time by 2-4% over version without this 
PR. This did not add up as I was expected some regression. I think allocating 
array of capacity of 128 at the start (instead of starting with default size 
16) is the sole reason for the perf. gain : 
https://github.com/tejasapatil/spark/blob/SPARK-13450_smb_buffer_oom/sql/core/src/main/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArray.scala#L43
 . I could remove that and rerun, but effectively the change will be deployed 
in this form and I wanted to see the effect of it over large workload.
    
    For micro-benchmarking, I took care of making the test case allocate array 
buffers of same size to avoid this from interfering the results. The numbers 
there look sane as there is some regression from using the spillable array.
    
    I see testcases `DataFrameJoinSuite` failing over jenkins but they run fine 
for me over intellij. Have you seen that before ? Just wanted to know if you if 
its a KP before I invest more time on that.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #16909: [SPARK-13450] Introduce ExternalAppendOnlyUnsafeRowArray...

Reply via email to