[jira] [Updated] (ARROW-7511) [C#] - Batch / Data Size Can't Exceed 2 gigs

Anthony Abate (Jira) Tue, 07 Jan 2020 12:25:21 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Anthony Abate updated ARROW-7511:
---------------------------------
    Description: 
While the Arrow spec does not forbid batches larger than 2 gigs, the C# library 
can not support this in its current form due to limits on managed memory as it 
tries to put the whole batch into a single Span<byte>/Memory<byte>

It is possible to fix this by not trying to use Memory/Span/byte[] for the 
entire Batch.. and instead move the memory mapping to the ArrowBuffers.  This 
only move the problem 'lower' as it would then still set the limit of a Column 
Data in a single batch to be 2 Gigs.  

This seems like plenty of memory... but if you think of strings columns, the 
data is just one giant string appended to together with offsets and it can get 
very large quickly.

I think the unfortunate problem is that memory management in the C# managed 
world is always going to hit the 2 gig limit somewhere. (please correct me if I 
am wrong on this statement, but I thought i read some where that Memory<T> / 
Span<T> are limited to int and changing to long would require major framework 
rewrites - but i may be conflating that with array)

That ultimately means the C# library either has to reject files of certain 
characteristics (ie validation checks on opening) , or the spec needs put upper 
limits on certain internal arrow constructs (ie arrow buffer) to eliminate the 
need for more than a 2 gigs of contiguous memory for the smallest arrow object.

However, If the spec was indeed designed for the smallest buffer object to be 
larger than 2 gigs, or for the entire memory buffer of arrow to be contiguous, 
one has to wonder if at some point, it might just make sense for the C# library 
to use the C++ library as its memory manager as replicating a very large blocks 
of memory more work than its wroth.

In any case,  this issue is more about 'deferring' the 2 gig size problem by 
moving it down to the buffer objects... This might require some re-write of the 
batch data structures

 

 

  was:
While the Arrow spec does not forbid batches larger than 2 gigs, the C# library 
can not support this in its current form due to limits on managed memory as it 
tries to put the whole batch into a single Span<byte>/Memory<byte>

It is possible to fix this by not trying to use Memory/Span/byte[] for the 
entire Batch.. and instead move the memory mapping to the ArrowBuffers.  This 
only move the problem 'lower' as it would then still set the limit of a Column 
Data in a single batch to be 2 Gigs.  

This seems like plenty of memory... but if you think of strings columns, the 
data is just one giant string appended to together with offsets and it can get 
very large quickly.

I think the unfortunate problem is that memory management in the C# managed 
world is always going to hit the 2 gig limit somewhere. (please correct me if I 
am wrong on this statement)

That ultimately means the C# library either has to reject files of certain 
characteristics (ie validation checks on opening) , or the spec needs put upper 
limits on certain internal arrow constructs (ie arrow buffer) to eliminate the 
need for more than a 2 gigs of contiguous memory for the smallest arrow object.

However, If the spec was indeed designed for the smallest buffer object to be 
larger than 2 gigs, or for the entire memory buffer of arrow to be contiguous, 
one has to wonder if at some point, it might just make sense for the C# library 
to use the C++ library as its memory manager as replicating a very large blocks 
of memory more work than its wroth.

In any case,  this issue is more about 'deferring' the 2 gig size problem by 
moving it down to the buffer objects... This might require some re-write of the 
batch data structures

 

 


> [C#] - Batch / Data Size Can't Exceed 2 gigs
> --------------------------------------------
>
>                 Key: ARROW-7511
>                 URL: https://issues.apache.org/jira/browse/ARROW-7511
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C#
>    Affects Versions: 0.15.1
>            Reporter: Anthony Abate
>            Priority: Major
>
> While the Arrow spec does not forbid batches larger than 2 gigs, the C# 
> library can not support this in its current form due to limits on managed 
> memory as it tries to put the whole batch into a single 
> Span<byte>/Memory<byte>
> It is possible to fix this by not trying to use Memory/Span/byte[] for the 
> entire Batch.. and instead move the memory mapping to the ArrowBuffers.  This 
> only move the problem 'lower' as it would then still set the limit of a 
> Column Data in a single batch to be 2 Gigs.  
> This seems like plenty of memory... but if you think of strings columns, the 
> data is just one giant string appended to together with offsets and it can 
> get very large quickly.
> I think the unfortunate problem is that memory management in the C# managed 
> world is always going to hit the 2 gig limit somewhere. (please correct me if 
> I am wrong on this statement, but I thought i read some where that Memory<T> 
> / Span<T> are limited to int and changing to long would require major 
> framework rewrites - but i may be conflating that with array)
> That ultimately means the C# library either has to reject files of certain 
> characteristics (ie validation checks on opening) , or the spec needs put 
> upper limits on certain internal arrow constructs (ie arrow buffer) to 
> eliminate the need for more than a 2 gigs of contiguous memory for the 
> smallest arrow object.
> However, If the spec was indeed designed for the smallest buffer object to be 
> larger than 2 gigs, or for the entire memory buffer of arrow to be 
> contiguous, one has to wonder if at some point, it might just make sense for 
> the C# library to use the C++ library as its memory manager as replicating a 
> very large blocks of memory more work than its wroth.
> In any case,  this issue is more about 'deferring' the 2 gig size problem by 
> moving it down to the buffer objects... This might require some re-write of 
> the batch data structures
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7511) [C#] - Batch / Data Size Can't Exceed 2 gigs

Reply via email to