jp0317 commented on PR #36192: URL: https://github.com/apache/arrow/pull/36192#issuecomment-1609860015
> I don't see anything wrong with this. I do think it might be hard to use. How is a user going to know which row groups or columns they are supposed to prebuffer? Maybe this is for a situation where a user needs to read the same parquet file over and over again and they can highly tune this operation? Thanks for your review. I think users can know the column chunk size by reading metadata and make their own choices. Repeatedly reading and tuning can be a use case but i think this pr might be used for more fine-grained reading on individual column chunk level. > Also, you may want to look into clearing the pre-buffer cache after a row group is read. That could help start releasing memory as the file is being read. Yes, clearing pre-buffer cache can be separate improvement? > Would this support a use case allowing us to use prebuffering and still read one row group at a time like: > > PreBuffer row group 0 Read row group 0 PreBuffer row group 1 Read row group 1 ... Yes, in addition i think it can be: Get column chunk sizes -> Determine prebuffer target chunks -> Call prebuffer -> Read row group -> ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
