[
https://issues.apache.org/jira/browse/IMPALA-12388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gabor Kaszab closed IMPALA-12388.
---------------------------------
Fix Version/s: Not Applicable
Resolution: Won't Fix
I explored some possible implementations for this, the simplest one was where I
unconditionally set the relevant null indicators to true for the position
delete related slots. This add the less overhead on top of the existing logic
in terms of performance.
I then started perf verifications on both TPCDS and TPCH, but apparently for
some queries this bring actual perf degradation. In worst case (a select-only
query) this results in a 5% increase of runtime. There were some queries where
I observed improvements around 2-3% but the overall results weren't convincing
for me to progress.
Closing this as won't fix as initial results aren't good enough to proceed.
> Strip file/pos information from tuples once they are not needed
> ---------------------------------------------------------------
>
> Key: IMPALA-12388
> URL: https://issues.apache.org/jira/browse/IMPALA-12388
> Project: IMPALA
> Issue Type: Bug
> Components: Backend, Frontend
> Reporter: Zoltán Borók-Nagy
> Assignee: Gabor Kaszab
> Priority: Major
> Labels: Performance, impala-iceberg, performance
> Fix For: Not Applicable
>
>
> When Impala processes Iceberg V2 tables that have position delete files it
> needs to add extra slots to the input tuples (requried by the ANTI JOIN
> between data files and delete files):
> * STRING file path
> * BIGINT position
> This makes the row-size larger by 20 bytes. Please note that this 20 bytes is
> only the increase in the tuple memory (12 byte STRING slot plus 8 byte BIGINT
> slot), the file path actually points to a potentially large string (100-200
> bytes) stored in a heap buffer.
> In the plan fragments of the SCANs we only create a string object per file
> for the file path (and set it in the template tuple), so the situation is not
> that bad, but once we send the rows over the network the STRINGs are getting
> duplicated per record, which can add substantial network and serialization
> overhead.
> One way to resolve this is to re-materialize the tuples after the Iceberg V2
> scan is done, and only store the interesting slots. This mechanism also saves
> us the 20 bytes per tuple overhead, but the re-materialization cost can be
> high.
> Another, easier solution is to just NULL-out the file path and position slots
> once they are not needed anymore.
> Of course if the user SELECTs the virtual column {{INPUT_FILE_NAME /
> FILE_POSITION}} we cannot re-materialize / NULL out.
> Given the following plan:
> {noformat}
> UNION ALL
> / \
> / \
> SCAN V2 ANTI JOIN
> data files / \
> without / \
> deletes SCAN SCAN
> data files delete files
> with deletes
> {noformat}
> In the "SCAN data files without deletes" we shouldn't even fill the file
> path / position slots. The latter also saves some computational cost.
> In our V2 ANTI JOIN operator (IcebergDeleteNode) we can NULL out the file
> path / pos slots once the data records are processed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)