Zoltán Borók-Nagy created IMPALA-12388:
------------------------------------------

             Summary: Strip file/pos information from tuples once they are not 
needed
                 Key: IMPALA-12388
                 URL: https://issues.apache.org/jira/browse/IMPALA-12388
             Project: IMPALA
          Issue Type: Bug
          Components: Backend, Frontend
            Reporter: Zoltán Borók-Nagy


When Impala processes Iceberg V2 tables that have position delete files it 
needs to add extra slots to the input tuples (requried by the ANTI JOIN between 
data files and delete files):
 * STRING file path
 * BIGINT position

This makes the row-size larger by 20 bytes. Please note that this 20 bytes is 
only the increase in the tuple memory (12 byte STRING slot plus 8 byte BIGINT 
slot), the file path actually points to a potentially large string (100-200 
bytes) stored in a heap buffer.

In the plan fragments of the SCANs we only create a string object per file for 
the file path (and set it in the template tuple), so the situation is not that 
bad, but once we send the rows over the network the STRINGs are getting 
duplicated per record, which can add substantial network and serialization 
overhead.

One way to resolve this is to re-materialize the tuples after the Iceberg V2 
scan is done, and only store the interesting slots. This mechanism also saves 
us the 20 bytes per tuple overhead, but the re-materialization cost can be high.

Another, easier solution is to just NULL-out the file path and position slots 
once they are not needed anymore.

Of course if the user SELECTs the virtual column INPUT_{_}FILE{_}_NAME we 
cannot re-materialize / NULL out.

Given the following plan:
{noformat}
    UNION ALL
    /        \
   /          \
SCAN          V2 ANTI JOIN
data files       /      \
without         /        \
deletes     SCAN         SCAN
            data files   delete files
            with deletes
{noformat}
In the "SCAN  data files without deletes" we shouldn't even fill the file path 
/ position slots. The latter also saves some computational cost.

In our V2 ANTI JOIN operator (IcebergDeleteNode) we can NULL out the file path 
/ pos slots once the data records are processed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to