RussellSpitzer commented on pull request #3983: URL: https://github.com/apache/iceberg/pull/3983#issuecomment-1072508346
After changing the default size of all types to 8 bytes and randomizing the shuffle input I get some different perf results. It seems like the pattern of the data is more important to the sort times than the amount of data in the sort field? Comit : 91a76010869e68633d598d718fb7929f6b531a71 ``` Benchmark Mode Cnt Score Error Units IcebergSortCompactionBenchmark.sortFourColumns ss 10 86.675 ± 8.177 s/op IcebergSortCompactionBenchmark.sortInt ss 10 74.262 ± 6.732 s/op IcebergSortCompactionBenchmark.sortInt2 ss 10 72.515 ± 8.014 s/op IcebergSortCompactionBenchmark.sortInt3 ss 10 76.228 ± 4.590 s/op IcebergSortCompactionBenchmark.sortInt4 ss 10 74.730 ± 4.267 s/op IcebergSortCompactionBenchmark.sortSixColumns ss 10 75.933 ± 5.042 s/op IcebergSortCompactionBenchmark.sortString ss 10 77.488 ± 6.804 s/op IcebergSortCompactionBenchmark.zSortFourColumns ss 10 277.954 ± 73.324 s/op IcebergSortCompactionBenchmark.zSortInt ss 10 327.105 ± 17.098 s/op IcebergSortCompactionBenchmark.zSortInt2 ss 10 328.217 ± 13.099 s/op IcebergSortCompactionBenchmark.zSortInt3 ss 10 342.404 ± 15.660 s/op IcebergSortCompactionBenchmark.zSortInt4 ss 10 344.997 ± 16.342 s/op IcebergSortCompactionBenchmark.zSortSixColumns ss 10 295.686 ± 62.866 s/op IcebergSortCompactionBenchmark.zSortString ss 10 333.303 ± 15.966 s/op ``` What is odd to me here is that the sort time for Strings is now ... basically the same as integers, all of our zorderings take about the same amount of time and so do all of our sortings without zorder. What is more interesting to me is that for ZOrdering this is basically increasing the ZORDER output byte size and have no effect on the comparison time. For Strings maybe this made sense ... but for ZSortInt 1,2,3,4 I would have expected things to take different amounts of times. Perhaps with a totally random layout of data the significant bits to compare on average always appear in the same location for ZOrder regardless of number of interleaved columns? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
