github-matthias-kunter commented on issue #10828: URL: https://github.com/apache/iceberg/issues/10828#issuecomment-2636810078
@RussellSpitzer We experience as well massive increase in Spark input data size after switching from raw Parquet ingestion to Iceberg table ingestion. This happens only for those jobs/ processes that read non-primitive columns (arrays, nested fields). If we leave those columns out in experiments, Iceberg table reads are extremely read data efficient. Usually by some orders of magnitude compared to raw Parquet. Since the source data is stored in S3 this is a significant cost factor. As I understood from the conversation above Iceberg uses Spark code (copied code) to read finally the Parquet files managed by an Iceberg table. But this code does not yet contain the optimizations for columnar reads of nested fields introduced in Spark 3.0. So, is there any plan for the near future to update those parts in the Iceberg code? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
