Re: [I] Performance reading S3 based files won't match localfilesystem even with large prebuffering. [arrow]

via GitHub Fri, 02 Feb 2024 11:25:03 -0800


mderoy commented on issue #39899:
URL: https://github.com/apache/arrow/issues/39899#issuecomment-1924538285


   > Firstly I think bool_reader->ReadBatch is a bit dangerous for nullable 
values
   
   I made the asumption that if values_read == 0 than I've processed a null 
value for that batch..but I will look into those rep-level and def-level 
concepts you mention... I've not really tested with nulls yet... I'm not 
dealing with any complex types like struct/list/map in my parser..mostly the 
simple primitive types.
   
   > 3s is so slow, would you mind tell the io pattern you're using? Actually 
the best pattern is send all io (if memory is enough) and waiting for them to 
finished, and read the file( or split the request by row-groups)
   
   I got the best (same as local file) performance when I prebuffered all the 
rowgroups and columns I wanted to read and then called WhenBuffered. We have a 
good amount of memory available to us. Splitting the request by row-groups 
would certainly help control memory provided the writer of the file did not 
write them too large. In my use case I have many processes processing their own 
files so I do not want to parallelize reading each column with an individual 
thread. I want one CPU thread to process the parsing of that one file (I know 
the prebuffering is happening by background threads but ideally this would be 
done serially as well)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Performance reading S3 based files won't match localfilesystem even with large prebuffering. [arrow]

Reply via email to