ozankabak commented on issue #3740:
URL: https://github.com/apache/arrow-rs/issues/3740#issuecomment-1444218489

   > I don't think this is avoidable, arrow is a columnar data format, it 
fundamentally assumes batching to amortise dispatch overheads. Row-based 
streaming would require a completely different architecture, likely using a JIT?
   
   @tustvold, I think there is maybe some terminology-related confusion going 
on here w.r.t. batching. I am sure @metesynnada was not trying to say he wants 
to avoid batching in its entirety. I think what he envisions (albeit maybe not 
conveyed clearly) is simply an API that operates with an async writer so that 
non-IO operations can carry on when the actual write to the object store is 
taking place.
   
   The current API (i.e. the `put` function) is already `async` and it performs 
the actual write in a separate thread AFAICT. If this is indeed true, it 
already doesn't stop the other non-IO operations. Given that we want to 
serialize synchronously for performance reasons, then it doesn't really matter 
where we do it -- the API seems sufficient to me as is. I just had a discussion 
with @metesynnada on this, he seems to agree and can comment further on this if 
I'm missing something.
   
   Given that we are analyzing this part of the code, one good thing we can do 
is to investigate whether avoiding the new IO thread and using async primitives 
to do the actual writing within the same thread makes sense. I am not entirely 
sure what the advantages/disadvantages of doing that will be. @metesynnada can 
do some measurements to quantify this. Maybe you can share the reasoning behind 
the current choice?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to