steveloughran commented on issue #15628:
URL: https://github.com/apache/iceberg/issues/15628#issuecomment-4097111735

   @RussellSpitzer Nesting is creating more hashmaps, which allocate map space. 
   
   I've been doing the parquet version of this test suite too, using this file 
as the foundation.
   
   https://github.com/apache/parquet-java/pull/3452
   
   There I've noticed that the granularity of ARM CPUs is nowhere as good as 
x86 `rdtsc` opcodes (ignoring core/socket variations there), so the very small 
operations need to be repeated multiple times within a single benchmark method. 
I'll apply that here too. Essentially: microbenchmarks need to take 
milliseconds to minimise clock granularity. jmh does at least pin the benchmark 
to a core so cross-socket and cross-core issues (older intel Si) don't surface. 
And by only working on the same struct the data should all be in $L1 so data 
cache issues _shouldn't_ be an issue, hence nor is NUMA. *Modern multisocket 
servers are the new datacentre*.
   
   what I'm doing now are:
   
   ## parquet-benchmark module row write with/without shedding.
   
   Assesses cost of write complexity and makes sure there is now problem 
lurking right at the bottom.
   
   ## iceberg spark module: table-level benchmark
   
   I'm creating two tables with same content then doing queries over them. I 
need to produce an uneven spread of numbers so that iceberg can use the shedded 
metadata to filter files; without that filtering shedded is coming of slower 
right now, which is presumably happening underneath. Spark parquet 
vectorization currently doesn't work with projection on shedded variants, so 
I'm turning vectorization off on both tables for consistency. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to