ahsanabbas123 opened a new issue, #48761:
URL: https://github.com/apache/arrow/issues/48761

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When serialising pyarrow expressions using substrait, the size of the 
serialised buffer increases exponentially with the number of binary operation 
clauses in the expression due to extension URIs being repeated again for each 
nested expression.  
   
   Testing with pyarrow 21.0.0: 
   ```
   
================================================================================
   Testing with 1 OR condition(s)
   
================================================================================
   Expression:  (int_col == 1)
   Substrait size: 201 bytes
   Total extension URIs: 1
   
   
================================================================================
   Testing with 2 OR condition(s)
   
================================================================================
   Expression:  ((int_col == 1) or (int_col == 2))
   Substrait size: 634 bytes
   Total extension URIs: 5
   
   
================================================================================
   Testing with 4 OR condition(s)
   
================================================================================
   Expression:  ((((int_col == 1) or (int_col == 2)) or (int_col == 3)) or 
(int_col == 4))
   Substrait size: 2967 bytes
   Total extension URIs: 29
   
   
================================================================================
   Testing with 6 OR condition(s)
   
================================================================================
   Expression:  ((((((int_col == 1) or (int_col == 2)) or (int_col == 3)) or 
(int_col == 4)) or (int_col == 5)) or (int_col == 6))
   Substrait size: 12017 bytes
   Total extension URIs: 125
   
   
================================================================================
   Testing with 8 OR condition(s)
   
================================================================================
   Expression:  ((((((((int_col == 1) or (int_col == 2)) or (int_col == 3)) or 
(int_col == 4)) or (int_col == 5)) or (int_col == 6)) or (int_col == 7)) or 
(int_col == 8))
   Substrait size: 48306 bytes
   Total extension URIs: 509
   
   
================================================================================
   Testing with 10 OR condition(s)
   
================================================================================
   Expression:  ((((((((((int_col == 1) or (int_col == 2)) or (int_col == 3)) 
or (int_col == 4)) or (int_col == 5)) or (int_col == 6)) or (int_col == 7)) or 
(int_col == 8)) or (int_col == 9)) or (int_col == 10))
   Substrait size: 193172 bytes
   Total extension URIs: 2045
   
   
================================================================================
   Testing with 12 OR condition(s)
   
================================================================================
   Expression:  ((((((((((((int_col == 1) or (int_col == 2)) or (int_col == 3)) 
or (int_col == 4)) or (int_col == 5)) or (int_col == 6)) or (int_col == 7)) or 
(int_col == 8)) or (int_col == 9)) or (int_col == 10)) or (int_col == 11)) or 
(int_col == 12))
   Substrait size: 772342 bytes
   Total extension URIs: 8189
   
   
================================================================================
   Testing with 14 OR condition(s)
   
================================================================================
   Expression:  ((((((((((((((int_col == 1) or (int_col == 2)) or (int_col == 
3)) or (int_col == 4)) or (int_col == 5)) or (int_col == 6)) or (int_col == 7)) 
or (int_col == 8)) or (int_col == 9)) or (int_col == 10)) or (int_col == 11)) 
or (int_col == 12)) or (int_col == 13)) or (int_col == 14))
   Substrait size: 3105111 bytes
   Total extension URIs: 32765
   
   
================================================================================
   Summary:
   
================================================================================
    OR Conditions  |  Size (bytes)   |    Size (KB)    | Extension URIs 
   ----------------------------------------------------------------------
          1        |       201       |      0.20       |        1       
          2        |       634       |      0.62       |        5       
          4        |      2,967      |      2.90       |       29       
          6        |     12,017      |      11.74      |       125      
          8        |     48,306      |      47.17      |       509      
         10        |     193,172     |     188.64      |      2045      
         12        |     772,342     |     754.24      |      8189      
         14        |    3,105,111    |     3032.33     |      32765     
   ```
    
   Serialization was done by simply: `buf = pas.serialize_expressions([expr], 
["result"], schema)`
   
   Unfortunately, it might not be possible to just use `is_in` in the 
expression so that might not be a possible workaround. 
   
   Would be nice to know if I am somehow misusing the API which is causing 
this. Thanks! 
   
   cc: @westonpace 
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to