pingtimeout opened a new pull request, #107:
URL: https://github.com/apache/polaris-tools/pull/107

   This PR adds a component to the Polaris benchmarks so that instead of using 
short, very predictable names for namespaces (e.g. `NS_1`, `NS_2`, ...), tables 
(e.g. `T_1`, `T_2`, ...) and views (e.g. `V_1`, `V_2`, ...), the benchmarks use 
the MD5 of each name instead.  The catalog name is never mangled.
   
   This behaviour is disabled by default.
   
   When it is enabled, all entities have a 32-characters name, composed of 
`[a-z0-9]` characters.  Those characters look like random characters but they 
are deterministic, as every name computed by benchmarks.
   
   The names are a lot less compression friendly, which was one of the core 
goals of this initiative.  With the default naming pattern, the names are both 
short and all start with the same prefix.  So they compress very nicely and may 
give a false estimate (lower bound) when doing database size capacity planning. 
 With the mangled names, the intent is to give an upper bound for that capacity 
planning exercise, considering that in typical catalogs, most names will be 
shorter than 32 characters and will use dictionary words.
   
   The names mangler was also used to discover #3243.  So once this PR is 
merged, there will be a trivial repro of the issue.
   
   Here is what it looks like for a simple tree:
   
   <img width="748" height="609" alt="Capture d’écran 2025-12-18 à 10 08 43" 
src="https://github.com/user-attachments/assets/2d636a5c-bd33-4409-848a-8ee775c0e7cb";
 />


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to