deniskuzZ commented on PR #5282: URL: https://github.com/apache/hive/pull/5282#issuecomment-2155498550
> > @zhangbutao, does this mean we always read the complete schema from iceberg and then applied project on Hive side? that should reduce number of bytes read, right? > > @deniskuzZ Not completely right. I think the original intention `project()` & `select()` stored on `scan `is to reduce the number of bytes read. But **on Hive side**, i found that we didn't take full advantage of the `column projection` propertity stored on `scan`. After some debugging, i think the main use `project()` & `select()` on scan is here,: just to do some scan metrics **but we don't use on Hive**: https://github.com/apache/iceberg/blob/c7de6cb345995cb47312edbef6edae2f17fb8aba/core/src/main/java/org/apache/iceberg/SnapshotScan.java#L143-L157 > > In fact, on Hive side, we propagate the `projected column` through `conf` not the `scan` class, and we get this `projected column` through conf where we need it, some code snippet like this: > > https://github.com/apache/hive/blob/98d9d22398370f817fe64449368671b978fff096/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java#L342 > > the final place in hive which use `projected column` is here(Parquet iceberg table): > > https://github.com/apache/hive/blob/98d9d22398370f817fe64449368671b978fff096/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java#L510 > > In one word, with or without this my change, we both can optimize the `project column`, as on Hive we don't rely on the `project()` & `select()` stored on `scan`, and we just use conf to propagate the `projected column`. But we need make the code correct & return the new scan, so that we can do some stuff in the future by using the `project()` & `select()` stored on `scan`. ` since it's not used now, could we add TODO otherwise that would be just a dead code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For additional commands, e-mail: gitbox-h...@hive.apache.org