marcin-krystianc commented on issue #5770: URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2117032859
> 10K columns by 10 row groups by 1M rows is 100B values (400GB with int32). I don't think anyone has data like that (this is presumptuous, I am probably wrong). > > My experience has been either: > > * The files they make are much smaller (and thus not enough or undersized row groups) E.g. financial data where ticker is column. > * The columns are very sparse (and thus a need for better sparse encoding). E.g. feature stores. Hi, we use parquet files with 100 row groups and 50k columns (and this is after the dataset has been split into many individual parquet files). What is worse, our use case is reading individual row groups and only subset of columns. That makes the cost of reading the entire metadata footer even higher than cost of reading the actual data (because we read entire footer and then read only tiny subset of the actual data). To deal with the problem we've implemented a tool that stores an index information in a separate file which allows for reading only a necessary subset of metadata. (https://github.com/G-Research/PalletJack).. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
