Re: [PR] Specialize ASCII case for substr() [datafusion]

via GitHub Sat, 14 Sep 2024 00:51:25 -0700


2010YOUY01 commented on PR #12444:
URL: https://github.com/apache/datafusion/pull/12444#issuecomment-2350899067


   > I am in general somewhat lukewarm on adding optimizations that make some 
queries faster and some slower (as it then becomes a tradeoff, and different 
users might have different tradeoffs).
   > 
   > It would be great to figure out how to avoid this tradeoff (I left one 
suggestion)
   
   I think this regression is fixable in the long term (by making ASCII check 
more efficient, currently especially for `StringView` ASCII check is not the 
most efficient way), but it's a good idea to be more conservative and skip 
ASCII validation for small prefix for now.
   I applied this suggestion and benched again and I think there is no 
noticeable ASCII check overhead:
   
   Result:
   `substr_before` is current main already with `StringView` optimization to 
avoid copy
   `susbtr_after` is this PR with additional ASCII fast path
   ```
   group                                                                        
      substr_after                           substr_before
   -----                                                                        
      ------------                           -------------
   LONGER THAN 12/substr_large_string [size=1024, count=64, strlen=128]         
      1.00     74.1±1.13µs        ? ?/sec    2.65    196.4±1.32µs        ? ?/sec
   LONGER THAN 12/substr_large_string [size=4096, count=64, strlen=128]         
      1.00    290.6±1.16µs        ? ?/sec    2.68   779.1±17.07µs        ? ?/sec
   LONGER THAN 12/substr_string [size=1024, count=64, strlen=128]               
      1.00     72.9±0.25µs        ? ?/sec    2.91   212.2±13.48µs        ? ?/sec
   LONGER THAN 12/substr_string [size=4096, count=64, strlen=128]               
      1.00    285.0±1.72µs        ? ?/sec    2.99   852.6±67.06µs        ? ?/sec
   LONGER THAN 12/substr_string_view [size=1024, count=64, strlen=128]          
      1.00     29.7±0.17µs        ? ?/sec    5.61   166.5±24.98µs        ? ?/sec
   LONGER THAN 12/substr_string_view [size=4096, count=64, strlen=128]          
      1.00    117.8±0.92µs        ? ?/sec    5.29   623.4±29.53µs        ? ?/sec
   SHORTER THAN 12/substr_large_string [size=1024, strlen=12]                   
      1.00     59.0±0.67µs        ? ?/sec    1.15     67.8±1.30µs        ? ?/sec
   SHORTER THAN 12/substr_large_string [size=4096, strlen=12]                   
      1.00    228.5±2.10µs        ? ?/sec    1.26   289.0±25.86µs        ? ?/sec
   SHORTER THAN 12/substr_string [size=1024, strlen=12]                         
      1.00     55.3±0.46µs        ? ?/sec    1.06     58.5±3.18µs        ? ?/sec
   SHORTER THAN 12/substr_string [size=4096, strlen=12]                         
      1.00    214.8±1.59µs        ? ?/sec    1.04    222.4±4.55µs        ? ?/sec
   SHORTER THAN 12/substr_string_view [size=1024, strlen=12]                    
      1.00     18.2±0.09µs        ? ?/sec    1.27     23.0±0.49µs        ? ?/sec
   SHORTER THAN 12/substr_string_view [size=4096, strlen=12]                    
      1.00     73.5±1.79µs        ? ?/sec    1.44   105.8±11.82µs        ? ?/sec
   SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=1024, count=6, 
strlen=128]    1.00     75.9±0.40µs        ? ?/sec    1.04     78.8±3.79µs      
  ? ?/sec
   SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=4096, count=6, 
strlen=128]    1.00    297.4±2.70µs        ? ?/sec    1.01    299.3±8.54µs      
  ? ?/sec
   SRC_LEN > 12, SUB_LEN < 12/substr_string [size=1024, count=6, strlen=128]    
      1.00     77.8±0.24µs        ? ?/sec    1.07    83.4±10.36µs        ? ?/sec
   SRC_LEN > 12, SUB_LEN < 12/substr_string [size=4096, count=6, strlen=128]    
      1.04    300.9±1.48µs        ? ?/sec    1.00    289.1±3.56µs        ? ?/sec
   SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=1024, count=6, 
strlen=128]     1.06     33.3±0.63µs        ? ?/sec    1.00     31.5±0.15µs     
   ? ?/sec
   SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=4096, count=6, 
strlen=128]     1.00    129.8±2.23µs        ? ?/sec    1.01   130.8±13.20µs     
   ? ?/sec
   ```
   
   > The other thing I keep thinking is how can we avoid this 'is_ascii' check 
at runtime (so things get faster regardless). Maybe it is time to consider 
starting to propage the is_ascii flag on the arrays themselves
   > 
   > The parquet reader, for example, knows when it has only ascii data
   
   I think it's a good idea. 
   I'm curious (and also to justify the extra complexity), is your (InfluxDB) 
real workload dominated by String data? I saw somewhere Databricks and Tableau 
said their production workload has >50% string data, many are the substitute 
for UDT, and also uncleaned raw data, for such case it should be worth the 
effort


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Specialize ASCII case for substr() [datafusion]

Reply via email to