[GitHub] [iceberg] asheeshgarg commented on issue #1078: Storing Lot of Sparse Columns

GitBox Wed, 03 Jun 2020 06:31:03 -0700


asheeshgarg commented on issue #1078:
URL: https://github.com/apache/iceberg/issues/1078#issuecomment-638198029



   Thanks Ryan
   Iceberg uses columnar formats that help, but doesn't automatically convert 
to a sparse representation if that's what you're referring to. -> Yes
   
   So currently what is happening I have 32K records of 40K columns in this 
case 80% of columns have a null value.
   if I persist this spark dataframe as csv roughly data comes out to 4 gb and 
it takes 4 min to persist the data in s3 bucket.
   if I persist this spark dataframe as parquet as as iceberg roughly data 
comes out to 19 MB and it takes 8 min to persist the data in s3 bucket.
   I tried uncompressed codec for parquet but timing doesn't improved.
   I feel the time is taken mostly in internal datastruture of parquet like 
dictionary encoding and other optimization for columnar data store.
   What I was referring is there a way to drop these 80% columns as they are 
null during storage which will significantly reduce the write time using some 
sparse storage technique.  
   So while reading back we can return default value for the columns.
   Any suggestion will be really helpful.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] asheeshgarg commented on issue #1078: Storing Lot of Sparse Columns

Reply via email to