felipecrv commented on issue #39682: URL: https://github.com/apache/arrow/issues/39682#issuecomment-1928634684
> Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180 `log(2147483646 + 2, 2)` is 31, so this is a check coming from `BaseBinaryBuilder<T>` which uses `sizeof<offset_type>` in the `memory_limit()` method. `FixedSizeBinaryBuilder` can produce the same message, but it always uses `int64_t` as these arrays don't need an offsets buffer. The regular string builder is based on `BaseBinaryBuilder<T>` and it uses 32-bits for the offsets array, so it requires all the concatenated strings to be addressable with 31 bits (~2GB). But the bug report says the column type is `LargeString` which uses 64-bit offsets! It shouldn't be a problem. Digging into the Parquet reader code, [I find...](https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader_internal.cc#L491) ```cpp // XXX: if a LargeBinary chunk is larger than 2GB, the MSBs of offsets // will be lost because they are first created as int32 and then cast to int64. ``` ...leading to a commit from November: ``` commit b4a0751effe316aee1a0fd80fb1c444ecd6842c5 Author: Antoine Pitrou <...> AuthorDate: Mon Nov 23 14:53:56 2020 +0100 ARROW-10426: [C++] Allow writing large strings to Parquet Large strings are still read back as regular strings. Closes #8632 from pitrou/ARROW-10426-parquet-large-binary Authored-by: Antoine Pitrou <[email protected]> ``` This is from PR #8632 that fixed Issue #26405 which is about *writing* LargeString into Arrow, but doesn't fix the reading them back part. I suppose the value in this is that C++/R/Python scripts can produce files that the Java Parquet reader can read without problems. Next steps: - Change this issue to "[C++] Parquet reader is unable to read LargeString columns" - Get help/tips from @pitrou @mapleFU on how to fix this - Prepare a fix PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
