[GitHub] [spark] EnricoMi commented on pull request #36150: [WIP][SPARK-38864][SQL] Add melt / unpivot to Dataset

GitBox Mon, 25 Apr 2022 04:48:13 -0700


EnricoMi commented on PR #36150:
URL: https://github.com/apache/spark/pull/36150#issuecomment-1108469324


   I have run the benchmarking in order to compare the proposed `Expand` 
implementation to `explode`, `stack` and `flatMap`.
   
   First the DataFrame is constructed with N rows and M columns, then calling 
`.cache().count()` to materialize it in memory (see "caching" below). A 
subsequent `.count()` proves that the DataFrame is cached, and also serves as 
reference for counting N rows across the respective partitions).
   
   Then I run the for melt-alternatives and count rows to materialize the 
melted rows: `.melt(…).count()`.
   
   All partitions have 100 million rows before melting. Running as `local[2]` 
with `-Xmx24G`.
   
   Here are my numbers, all measurements are in milliseconds:
   
   **Number of Rows: 10 000 000**
   
   | Number of Columns |   10   |   30   |   50   |    70   |    90   |   110   
|   1 000   |
   
|:------------------|-------:|-------:|-------:|--------:|--------:|--------:|----------:|
   | caching           | 20 387 | 42 260 | 72 858 | 112 506 | 127 460 | 177 984 
| 1 855 548 |
   | count             |    146 |     95 |     75 |      74 |      75 |      83 
|       657 |
   | expand            |  1 450 |  2 159 |  3 520 |  10 071 |  13 366 |  24 816 
|   745 043 |
   | explode           |  1 405 |  1 655 |  2 775 |   9 835 |  10 515 | 106 724 
|   828 351 |
   | stack             |  4 464 | 10 369 | 16 859 |  50 667 |  64 555 |  79 302 
|   650 160 |
   | flatMap           |  8 707 | 15 463 | 24 404 |  39 295 |  43 314 |  51 184 
|   587 622 |
   
   
   **Number of Rows: 100 000 000**
   
   | Number of Columns |   10   |    30   |    50   |     70    |     90    |   
  110   |
   
|:------------------|-------:|--------:|--------:|----------:|----------:|----------:|
   | caching           | 95 800 | 330 990 | 628 537 | 1 020 876 | 1 289 393 | 1 
731 191 |
   | count             |    319 |     304 |     325 |       302 |       319 |   
    381 |
   | expand            |  4 591 |  14 410 |  26 815 |   101 935 |   122 635 |   
223 855 |
   | explode           |  4 260 |  11 336 |  21 475 |    81 064 |   102 045 | 1 
074 247 |
   | stack             | 12 359 |  76 112 | 134 372 |   419 610 |   591 428 |   
717 853 |
   | flatMap           | 37 449 | 118 105 | 209 193 |   303 408 |   417 899 |   
488 466 |
   
   The `explode` approach seems to be the most efficient below 100 columns, 
then degrading badly. The `stack` approach degrades over 50 columns (possibly 
due to the code generation below 50 columns). The `flatMap` is always worst, 
except for the 1000 columns. Overall, `expand` looks comparative to `explode`, 
but more stable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] EnricoMi commented on pull request #36150: [WIP][SPARK-38864][SQL] Add melt / unpivot to Dataset

Reply via email to