[
https://issues.apache.org/jira/browse/ARROW-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anthony Abate updated ARROW-7511:
---------------------------------
Description:
While the Arrow spec does not forbid batches larger than 2 gigs, the C# library
can not support this in its current form due to limits on managed memory as it
tries to put the whole batch into a single Span<byte>/Memory<byte>
It is possible to fix this by not trying to use Memory/Span/byte[] for the
entire Batch.. and instead move the memory mapping to the ArrowBuffers. This
only move the problem 'lower' as it would then still set the limit of a Column
Data in a single batch to be 2 Gigs.
This seems like plenty of memory... but if you think of strings columns, the
data is just one giant string appended to together with offsets and it can get
very large quickly.
I think the unfortunate problem is that memory management in the C# managed
world is always going to hit the 2 gig limit somewhere. (please correct me if I
am wrong on this statement, but I thought i read some where that Memory<T> /
Span<T> are limited to int and changing to long would require major framework
rewrites - but i may be conflating that with array)
That ultimately means the C# library either has to reject files of certain
characteristics (ie validation checks on opening) , or the spec needs put upper
limits on certain internal arrow constructs (ie arrow buffer) to eliminate the
need for more than a 2 gigs of contiguous memory for the smallest arrow object.
However, If the spec was indeed designed for the smallest buffer object to be
larger than 2 gigs, or for the entire memory buffer of arrow to be contiguous,
one has to wonder if at some point, it might just make sense for the C# library
to use the C++ library as its memory manager as replicating a very large blocks
of memory more work than its wroth.
In any case, this issue is more about 'deferring' the 2 gig size problem by
moving it down to the buffer objects... This might require some re-write of the
batch data structures
was:
While the Arrow spec does not forbid batches larger than 2 gigs, the C# library
can not support this in its current form due to limits on managed memory as it
tries to put the whole batch into a single Span<byte>/Memory<byte>
It is possible to fix this by not trying to use Memory/Span/byte[] for the
entire Batch.. and instead move the memory mapping to the ArrowBuffers. This
only move the problem 'lower' as it would then still set the limit of a Column
Data in a single batch to be 2 Gigs.
This seems like plenty of memory... but if you think of strings columns, the
data is just one giant string appended to together with offsets and it can get
very large quickly.
I think the unfortunate problem is that memory management in the C# managed
world is always going to hit the 2 gig limit somewhere. (please correct me if I
am wrong on this statement)
That ultimately means the C# library either has to reject files of certain
characteristics (ie validation checks on opening) , or the spec needs put upper
limits on certain internal arrow constructs (ie arrow buffer) to eliminate the
need for more than a 2 gigs of contiguous memory for the smallest arrow object.
However, If the spec was indeed designed for the smallest buffer object to be
larger than 2 gigs, or for the entire memory buffer of arrow to be contiguous,
one has to wonder if at some point, it might just make sense for the C# library
to use the C++ library as its memory manager as replicating a very large blocks
of memory more work than its wroth.
In any case, this issue is more about 'deferring' the 2 gig size problem by
moving it down to the buffer objects... This might require some re-write of the
batch data structures
> [C#] - Batch / Data Size Can't Exceed 2 gigs
> --------------------------------------------
>
> Key: ARROW-7511
> URL: https://issues.apache.org/jira/browse/ARROW-7511
> Project: Apache Arrow
> Issue Type: Bug
> Components: C#
> Affects Versions: 0.15.1
> Reporter: Anthony Abate
> Priority: Major
>
> While the Arrow spec does not forbid batches larger than 2 gigs, the C#
> library can not support this in its current form due to limits on managed
> memory as it tries to put the whole batch into a single
> Span<byte>/Memory<byte>
> It is possible to fix this by not trying to use Memory/Span/byte[] for the
> entire Batch.. and instead move the memory mapping to the ArrowBuffers. This
> only move the problem 'lower' as it would then still set the limit of a
> Column Data in a single batch to be 2 Gigs.
> This seems like plenty of memory... but if you think of strings columns, the
> data is just one giant string appended to together with offsets and it can
> get very large quickly.
> I think the unfortunate problem is that memory management in the C# managed
> world is always going to hit the 2 gig limit somewhere. (please correct me if
> I am wrong on this statement, but I thought i read some where that Memory<T>
> / Span<T> are limited to int and changing to long would require major
> framework rewrites - but i may be conflating that with array)
> That ultimately means the C# library either has to reject files of certain
> characteristics (ie validation checks on opening) , or the spec needs put
> upper limits on certain internal arrow constructs (ie arrow buffer) to
> eliminate the need for more than a 2 gigs of contiguous memory for the
> smallest arrow object.
> However, If the spec was indeed designed for the smallest buffer object to be
> larger than 2 gigs, or for the entire memory buffer of arrow to be
> contiguous, one has to wonder if at some point, it might just make sense for
> the C# library to use the C++ library as its memory manager as replicating a
> very large blocks of memory more work than its wroth.
> In any case, this issue is more about 'deferring' the 2 gig size problem by
> moving it down to the buffer objects... This might require some re-write of
> the batch data structures
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)