mderoy commented on issue #39899: URL: https://github.com/apache/arrow/issues/39899#issuecomment-1924538285
> Firstly I think bool_reader->ReadBatch is a bit dangerous for nullable values I made the asumption that if values_read == 0 than I've processed a null value for that batch..but I will look into those rep-level and def-level concepts you mention... I've not really tested with nulls yet... I'm not dealing with any complex types like struct/list/map in my parser..mostly the simple primitive types. > 3s is so slow, would you mind tell the io pattern you're using? Actually the best pattern is send all io (if memory is enough) and waiting for them to finished, and read the file( or split the request by row-groups) I got the best (same as local file) performance when I prebuffered all the rowgroups and columns I wanted to read and then called WhenBuffered. We have a good amount of memory available to us. Splitting the request by row-groups would certainly help control memory provided the writer of the file did not write them too large. In my use case I have many processes processing their own files so I do not want to parallelize reading each column with an individual thread. I want one CPU thread to process the parsing of that one file (I know the prebuffering is happening by background threads but ideally this would be done serially as well) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
