zhangbutao commented on PR #5282:
URL: https://github.com/apache/hive/pull/5282#issuecomment-2154202895

   > @zhangbutao, does this mean we always read the complete schema from 
iceberg and then applied project on Hive side? that should reduce number of 
bytes read, right?
   
   @deniskuzZ Not completely right. I think the original intention `project()` 
& `select()`  stored on `scan `is to reduce the number of bytes read.
   But **on Hive side**, i found that we didn't take full advantage of the 
`column projection` propertity stored on `scan`. After some debugging, i think 
the main use `project()` & `select()` on scan is here,: just to do some scan 
metrics **but we don't use on Hive**:
   
https://github.com/apache/iceberg/blob/c7de6cb345995cb47312edbef6edae2f17fb8aba/core/src/main/java/org/apache/iceberg/SnapshotScan.java#L143-L157
   
   In fact, on Hive side, we propagate the `projected column` through `conf` 
not the `scan` class, and we get this `projected column` through conf where we 
need it, some code snippet like this:
   
https://github.com/apache/hive/blob/98d9d22398370f817fe64449368671b978fff096/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java#L342
   
   the final place in hive which use `projected column` is here(Parquet iceberg 
table):
   
https://github.com/apache/hive/blob/98d9d22398370f817fe64449368671b978fff096/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java#L510
   
   
   In one word, with or without this my change, we both can optimize the 
`project column`, as on Hive we don't rely on the  `project()` & `select()` 
stored on `scan`, and we just use conf to propagate the `projected column`.
   But we need make the code correct & return the new scan, so that we can do 
some stuff in the future by using the `project()` & `select()` stored on `scan`.
   `


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org

Reply via email to