Zoltán Borók-Nagy created IMPALA-12388:
------------------------------------------
Summary: Strip file/pos information from tuples once they are not
needed
Key: IMPALA-12388
URL: https://issues.apache.org/jira/browse/IMPALA-12388
Project: IMPALA
Issue Type: Bug
Components: Backend, Frontend
Reporter: Zoltán Borók-Nagy
When Impala processes Iceberg V2 tables that have position delete files it
needs to add extra slots to the input tuples (requried by the ANTI JOIN between
data files and delete files):
* STRING file path
* BIGINT position
This makes the row-size larger by 20 bytes. Please note that this 20 bytes is
only the increase in the tuple memory (12 byte STRING slot plus 8 byte BIGINT
slot), the file path actually points to a potentially large string (100-200
bytes) stored in a heap buffer.
In the plan fragments of the SCANs we only create a string object per file for
the file path (and set it in the template tuple), so the situation is not that
bad, but once we send the rows over the network the STRINGs are getting
duplicated per record, which can add substantial network and serialization
overhead.
One way to resolve this is to re-materialize the tuples after the Iceberg V2
scan is done, and only store the interesting slots. This mechanism also saves
us the 20 bytes per tuple overhead, but the re-materialization cost can be high.
Another, easier solution is to just NULL-out the file path and position slots
once they are not needed anymore.
Of course if the user SELECTs the virtual column INPUT_{_}FILE{_}_NAME we
cannot re-materialize / NULL out.
Given the following plan:
{noformat}
UNION ALL
/ \
/ \
SCAN V2 ANTI JOIN
data files / \
without / \
deletes SCAN SCAN
data files delete files
with deletes
{noformat}
In the "SCAN data files without deletes" we shouldn't even fill the file path
/ position slots. The latter also saves some computational cost.
In our V2 ANTI JOIN operator (IcebergDeleteNode) we can NULL out the file path
/ pos slots once the data records are processed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)