steveloughran commented on issue #15628: URL: https://github.com/apache/iceberg/issues/15628#issuecomment-4097111735
@RussellSpitzer Nesting is creating more hashmaps, which allocate map space. I've been doing the parquet version of this test suite too, using this file as the foundation. https://github.com/apache/parquet-java/pull/3452 There I've noticed that the granularity of ARM CPUs is nowhere as good as x86 `rdtsc` opcodes (ignoring core/socket variations there), so the very small operations need to be repeated multiple times within a single benchmark method. I'll apply that here too. Essentially: microbenchmarks need to take milliseconds to minimise clock granularity. jmh does at least pin the benchmark to a core so cross-socket and cross-core issues (older intel Si) don't surface. And by only working on the same struct the data should all be in $L1 so data cache issues _shouldn't_ be an issue, hence nor is NUMA. *Modern multisocket servers are the new datacentre*. what I'm doing now are: ## parquet-benchmark module row write with/without shedding. Assesses cost of write complexity and makes sure there is now problem lurking right at the bottom. ## iceberg spark module: table-level benchmark I'm creating two tables with same content then doing queries over them. I need to produce an uneven spread of numbers so that iceberg can use the shedded metadata to filter files; without that filtering shedded is coming of slower right now, which is presumably happening underneath. Spark parquet vectorization currently doesn't work with projection on shedded variants, so I'm turning vectorization off on both tables for consistency. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
