[I] [multistage][bug] block splitter estimation is way off [pinot]

via GitHub Wed, 01 Nov 2023 09:24:38 -0700


walterddr opened a new issue, #11921:
URL: https://github.com/apache/pinot/issues/11921


   when we send data over the mailboxes we are estimating the data size and cut 
the inbound messges into chunks. however
   
   ```
   block.getDataSchema().getColumnNames().length * MEDIAN_COLUMN_SIZE_BYTES;
   ```
         // Use estimated row size, this estimate is not accurate and is used 
to estimate numRowsPerChunk only.
         int estimatedRowSizeInBytes = 
block.getDataSchema().getColumnNames().length * MEDIAN_COLUMN_SIZE_BYTES;
         int numRowsPerChunk = maxBlockSize / estimatedRowSizeInBytes;
         while (currentRow < totalNumRows) {
           List<Object[]> chunk = allRows.subList(currentRow, 
Math.min(currentRow + numRowsPerChunk, allRows.size()));
   ```
   this is not an accurate estimate when there's high-cardinality string/bytes 
column that can be super large.
   
   simple solution is to use the first row to estimate the size of the row when 
there's variable length columns found, but 
   - there's no easy way to tell cardinality
   - it is expensive to compute a row size of `Object[]` which needs to loop 
through everything. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [multistage][bug] block splitter estimation is way off [pinot]

Reply via email to