[GitHub] [arrow] jp0317 commented on pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

via GitHub Tue, 27 Jun 2023 09:31:19 -0700


jp0317 commented on PR #36192:
URL: https://github.com/apache/arrow/pull/36192#issuecomment-1609860015


   > I don't see anything wrong with this. I do think it might be hard to use. 
How is a user going to know which row groups or columns they are supposed to 
prebuffer? Maybe this is for a situation where a user needs to read the same 
parquet file over and over again and they can highly tune this operation?
   
   Thanks for your review. I think users can know the column chunk size by 
reading metadata and make their own choices.  Repeatedly reading and tuning can 
be a use case but i think this pr might be used for more fine-grained reading 
on individual column chunk level. 
   
   > Also, you may want to look into clearing the pre-buffer cache after a row 
group is read. That could help start releasing memory as the file is being read.
   
   Yes, clearing pre-buffer cache can be separate improvement?
   
   > Would this support a use case allowing us to use prebuffering and still 
read one row group at a time like:
   >
   > PreBuffer row group 0 Read row group 0 PreBuffer row group 1 Read row 
group 1 ...
   
   Yes, in addition i think it can be: Get column chunk sizes -> Determine 
prebuffer target chunks -> Call prebuffer -> Read row group -> ...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jp0317 commented on pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Reply via email to