tustvold opened a new issue, #4812: URL: https://github.com/apache/arrow-rs/issues/4812
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** <!-- A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] (This section helps Arrow developers understand the context and *why* for this feature, in addition to the *what*) --> Currently the row format pads variable length payloads to 32 byte chunks. This is performant and easy to reason about, but is very inefficient for small strings. **Describe the solution you'd like** <!-- A clear and concise description of what you want to happen. --> Instead of every block having the same size I would propose the first few blocks have a smaller size. In particular: - 0th block - 4 bytes - 1st block - 8 bytes - 2nd block - 16 bytes - Remaining blocks - 32 bytes This would drastically reduce the space amplification for small strings, reducing memory usage and potentially yielding faster comparisons **Describe alternatives you've considered** <!-- A clear and concise description of any alternative solutions or features you've considered. --> **Additional context** <!-- Add any other context or screenshots about the feature request here. --> #4811 proposes removing the dictionary interning which would likely make this optimisation more important -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
