pingtimeout opened a new pull request, #107: URL: https://github.com/apache/polaris-tools/pull/107
This PR adds a component to the Polaris benchmarks so that instead of using short, very predictable names for namespaces (e.g. `NS_1`, `NS_2`, ...), tables (e.g. `T_1`, `T_2`, ...) and views (e.g. `V_1`, `V_2`, ...), the benchmarks use the MD5 of each name instead. The catalog name is never mangled. This behaviour is disabled by default. When it is enabled, all entities have a 32-characters name, composed of `[a-z0-9]` characters. Those characters look like random characters but they are deterministic, as every name computed by benchmarks. The names are a lot less compression friendly, which was one of the core goals of this initiative. With the default naming pattern, the names are both short and all start with the same prefix. So they compress very nicely and may give a false estimate (lower bound) when doing database size capacity planning. With the mangled names, the intent is to give an upper bound for that capacity planning exercise, considering that in typical catalogs, most names will be shorter than 32 characters and will use dictionary words. The names mangler was also used to discover #3243. So once this PR is merged, there will be a trivial repro of the issue. Here is what it looks like for a simple tree: <img width="748" height="609" alt="Capture d’écran 2025-12-18 à 10 08 43" src="https://github.com/user-attachments/assets/2d636a5c-bd33-4409-848a-8ee775c0e7cb" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
