Anthony Abate created ARROW-7511:
------------------------------------

             Summary: [C#] - Batch / Data Size Can't Exceed 2 gigs
                 Key: ARROW-7511
                 URL: https://issues.apache.org/jira/browse/ARROW-7511
             Project: Apache Arrow
          Issue Type: Bug
          Components: C#
    Affects Versions: 0.15.1
            Reporter: Anthony Abate


While the Arrow spec does not forbid batches larger than 2 gigs, the C# library 
can not support this in its current form due to limits on managed memory as it 
tries to put the whole batch into a single Span<byte>/Memory<byte>

It is possible to fix this by not trying to use Memory/Span/byte[] for the 
entire Batch.. and instead move the memory mapping to the ArrowBuffers.  This 
only move the problem 'lower' as it would then still set the limit of a Column 
Data in a single batch to be 2 Gigs.  

This seems like plenty of memory... but if you think of strings columns, the 
data is just one giant string appended to together with offsets and it can get 
very large quickly.

I think the unfortunate problem is that memory management in the C# managed 
world is always going to hit the 2 gig limit somewhere. (please correct me if I 
am wrong on this statement)

That ultimately means the C# library either has to reject files of certain 
characteristics (ie validation checks on opening) , or the spec needs put upper 
limits on certain internal arrow constructs (ie arrow buffer) to eliminate the 
need for more than a 2 gigs of contiguous memory for the smallest arrow object.

However, If the spec was indeed designed for the smallest buffer object to be 
larger than 2 gigs, or for the entire memory buffer of arrow to be contiguous, 
one has to wonder if at some point, it might just make sense for the C# library 
to use the C++ library as its memory manager as replicating a very large blocks 
of memory more work than its wroth.

In any case,  this issue is more about 'deferring' the 2 gig size problem by 
moving it down to the buffer objects... This might require some re-write of the 
batch data structures

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to