asheeshgarg commented on issue #1078: URL: https://github.com/apache/iceberg/issues/1078#issuecomment-638198029
Thanks Ryan Iceberg uses columnar formats that help, but doesn't automatically convert to a sparse representation if that's what you're referring to. -> Yes So currently what is happening I have 32K records of 40K columns in this case 80% of columns have a null value. if I persist this spark dataframe as csv roughly data comes out to 4 gb and it takes 4 min to persist the data in s3 bucket. if I persist this spark dataframe as parquet as as iceberg roughly data comes out to 19 MB and it takes 8 min to persist the data in s3 bucket. I tried uncompressed codec for parquet but timing doesn't improved. I feel the time is taken mostly in internal datastruture of parquet like dictionary encoding and other optimization for columnar data store. What I was referring is there a way to drop these 80% columns as they are null during storage which will significantly reduce the write time using some sparse storage technique. So while reading back we can return default value for the columns. Any suggestion will be really helpful. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
