Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )
Change subject: IMPALA-9226: Improve string allocations of the ORC scanner ...................................................................... Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc@683 PS3, Line 683: dst_batch->tuple_data_pool() As we have discussed offline, this is no longer correct. Using the current batch's mempool means that the memory can be reclaimed by the consumer of this row batch (the caller of this node's GetNext()) after the current GetNext() call returned. So it cannot be used in future batches. Before your optimization, only those string were allocated here, which were used in the current row batch, so there was no problem. Now if batch_one starts the read rows from the Orc vector, and batch_two finishes it, the blob will belong to batch_one, so it can be already freed/reused when batch_two starts to use it. The solution is to give the scanner its own mempool, use that to initialize the blobs, and attach the blobs to the last row batch that reads rows from the given vector. For Parquet this is mempool is https://github.com/apache/impala/blob/4fa6b5260d9e28dee63a87de3bea1434706a9d05/be/src/exec/parquet/parquet-column-chunk-reader.h#L140 Parquet dictionaries use a different pool: https://github.com/apache/impala/blob/4fa6b5260d9e28dee63a87de3bea1434706a9d05/be/src/exec/parquet/hdfs-parquet-scanner.h#L430 Doing this in Orc will be much simpler, as the library hides compression/decoding of data, but the way buffers are attached to row batches should be the same. -- To view, visit http://gerrit.cloudera.org:8080/15051 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f Gerrit-Change-Number: 15051 Gerrit-PatchSet: 3 Gerrit-Owner: Norbert Luksa <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Norbert Luksa <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Tue, 28 Jan 2020 16:24:07 +0000 Gerrit-HasComments: Yes
