Gabor Kaszab created IMPALA-10919:
-------------------------------------

             Summary: Fine tune row batch handling to support arrays with 
millions of items
                 Key: IMPALA-10919
                 URL: https://issues.apache.org/jira/browse/IMPALA-10919
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
            Reporter: Gabor Kaszab


Currently, during row batch serialization there is an upper limit for the row 
batch size of int32::max():
https://github.com/apache/impala/blob/b67c0906f596ca336d0ea0e8cbc618a20ac0e563/be/src/runtime/row-batch.cc#L348

As a result if we query an array of millions of items in the select list it 
won't fit into a serialized row batch even if it stores integers. With the 
current implementation, taking into account that the default number of rows in 
a row batch is 1024 then the limit now allow approximately 560k integers in 
each row. (560k * 4byte * 1024 rows = ~int32::max() )

There is a workaround to reduce the number of rows in a row batch with the 
BATCH_SIZE query option but that seems an overkill as it reduces the batch size 
for all the nodes in the query but most probably after the arrays are being 
unnested it is safe to use the default batch size.

Just an idea but it would be nice to dynamically adjust the row batch size when 
big arrays are queried so that they can fit in a serialized row bacth, but when 
the unnesting happened (and there are no longer arrays in the row batch) the 
default size could be used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to