walterddr opened a new issue, #11921:
URL: https://github.com/apache/pinot/issues/11921
when we send data over the mailboxes we are estimating the data size and cut
the inbound messges into chunks. however
```
block.getDataSchema().getColumnNames().length * MEDIAN_COLUMN_SIZE_BYTES;
```
// Use estimated row size, this estimate is not accurate and is used
to estimate numRowsPerChunk only.
int estimatedRowSizeInBytes =
block.getDataSchema().getColumnNames().length * MEDIAN_COLUMN_SIZE_BYTES;
int numRowsPerChunk = maxBlockSize / estimatedRowSizeInBytes;
while (currentRow < totalNumRows) {
List<Object[]> chunk = allRows.subList(currentRow,
Math.min(currentRow + numRowsPerChunk, allRows.size()));
```
this is not an accurate estimate when there's high-cardinality string/bytes
column that can be super large.
simple solution is to use the first row to estimate the size of the row when
there's variable length columns found, but
- there's no easy way to tell cardinality
- it is expensive to compute a row size of `Object[]` which needs to loop
through everything.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]