felipecrv commented on issue #39682:
URL: https://github.com/apache/arrow/issues/39682#issuecomment-1928634684

   > Capacity error: array cannot contain more than 2147483646 bytes, have 
2147489180
   
   `log(2147483646 + 2, 2)` is 31, so this is a check coming from 
`BaseBinaryBuilder<T>` which uses `sizeof<offset_type>` in the `memory_limit()` 
method. `FixedSizeBinaryBuilder` can produce the same message, but it always 
uses `int64_t` as these arrays don't need an offsets buffer.
   
   The regular string builder is based on `BaseBinaryBuilder<T>` and it uses 
32-bits for the offsets array, so it requires all the concatenated strings to 
be addressable with 31 bits (~2GB). But the bug report says the column type is 
`LargeString` which uses 64-bit offsets! It shouldn't be a problem.
   
   Digging into the Parquet reader code, [I 
find...](https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader_internal.cc#L491)
 
   
   ```cpp
         // XXX: if a LargeBinary chunk is larger than 2GB, the MSBs of offsets
         // will be lost because they are first created as int32 and then cast 
to int64.
   ```
   
   ...leading to a commit from November:
   
   ```
   commit b4a0751effe316aee1a0fd80fb1c444ecd6842c5
   Author:     Antoine Pitrou <...>
   AuthorDate: Mon Nov 23 14:53:56 2020 +0100
   
       ARROW-10426: [C++] Allow writing large strings to Parquet
   
       Large strings are still read back as regular strings.
   
       Closes #8632 from pitrou/ARROW-10426-parquet-large-binary
   
       Authored-by: Antoine Pitrou <[email protected]>
   ```
   
   This is from PR #8632 that fixed Issue #26405 which is about *writing* 
LargeString into Arrow, but doesn't fix the reading them back part. I suppose 
the value in this is that C++/R/Python scripts can produce files that the Java 
Parquet reader can read without problems.
   
   Next steps:
    - Change this issue to "[C++] Parquet reader is unable to read LargeString 
columns"
    - Get help/tips from @pitrou @mapleFU on how to fix this
    - Prepare a fix PR
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to