[I] Capacity Error when writing large list columns [arrow]

via GitHub Tue, 22 Jul 2025 15:11:39 -0700


wklimowicz opened a new issue, #47169:
URL: https://github.com/apache/arrow/issues/47169


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When writing large list columns to parquet, arrow errors out with:
   
   ```
   Error: Capacity error: List array cannot contain more than 2147483646 
elements, have 1200
   ```
   
   Reproducible example, works with the CRAN `arrow` version (20.0.0.2), and 
the current git version (21.0.0.9000).
   
   ```r
   library(tibble)
   library(arrow)
   
   rows <- 2e6L
   elements_each <- 1200L
   
   tbl <- tibble(
     id = seq_len(rows),
     b = replicate(rows, list(seq_len(elements_each)), simplify = FALSE)
   )
   
   write_parquet(tbl, "big_list.parquet")
   ```
   
   Actual behaviour: `Error: Capacity error: List array cannot contain more 
than 2147483646 elements, have 1200`.
   
   Expected behaviour: Automatically chunking behind the scenes, or a 
suggestion of how the user should chunk manually.
   
   I think this is a similar bug to #10776, but happens with writing rather 
than reading. I'm looking for clarity whether this can be automatically chunked 
in the spirit of spirit of the 
[vignette](https://arrow.apache.org/docs/r/articles/data_objects.html):
   
   > An important thing to note is that “chunking” is not semantically 
meaningful. It is an implementation detail only: users should never treat the 
chunk as a meaningful unit.
   
   Alternatively a workaround would be good: I've tried some with 
`write_dataset`, but I don't understand the internals well enough. Two things 
which didn't work (same error):
   
   ```r
   # Approach 1:
   # Group by + write_dataset
   tbl |>
     dplyr::group_by(id = id %% 10L) |> # Create many groups by ID
     write_dataset("big_list")
   
   # Approach 2:
   # max_rows...
   tbl |>
     write_dataset(
       "big_list",
       max_rows_per_file = 5000L,
       max_rows_per_group = 5000L
     )
   ```
   
   <details>
   
   <summary> session_info() </summary>
   
   ```
   ─ Session info ──────────────────────
    setting  value
    version  R version 4.5.0 (2025-04-11)
    os       Fedora Linux 42 (Workstation Edition)
    system   x86_64, linux-gnu
    ui       X11
    language (EN)
    collate  en_GB.UTF-8
    ctype    en_GB.UTF-8
    tz       Europe/London
    date     2025-07-22
    pandoc   3.1.11.1 @ /usr/bin/pandoc
    quarto   99.9.9 @ /home/wojtek/.local/bin/quarto
   
   ─ Packages ───────────────────────────
    package     * version     date (UTC) lib source
    arrow       * 21.0.0.9000 2025-07-22 [1] local
    assertthat    0.2.1       2019-03-21 [1] CRAN (R 4.5.0)
    bit           4.6.0       2025-03-06 [1] CRAN (R 4.5.0)
    bit64         4.6.0-1     2025-01-16 [1] CRAN (R 4.5.0)
    cli           3.6.5       2025-04-23 [1] CRAN (R 4.5.0)
    glue          1.8.0       2024-09-30 [1] CRAN (R 4.5.0)
    lifecycle     1.0.4       2023-11-07 [1] CRAN (R 4.5.0)
    magrittr      2.0.3       2022-03-30 [1] CRAN (R 4.5.0)
    pillar        1.11.0      2025-07-04 [1] CRAN (R 4.5.0)
    pkgconfig     2.0.3       2019-09-22 [1] CRAN (R 4.5.0)
    purrr         1.1.0       2025-07-10 [1] CRAN (R 4.5.0)
    R6            2.6.1       2025-02-15 [1] CRAN (R 4.5.0)
    rlang         1.1.6       2025-04-11 [1] CRAN (R 4.5.0)
    sessioninfo   1.2.3       2025-02-05 [1] CRAN (R 4.5.0)
    tibble      * 3.3.0       2025-06-08 [1] CRAN (R 4.5.0)
    tidyselect    1.2.1       2024-03-11 [1] CRAN (R 4.5.0)
    vctrs         0.6.5       2023-12-01 [1] CRAN (R 4.5.0)
   
    [1] /home/wojtek/.local/share/R/x86_64-pc-linux-gnu-library/4.5
    [2] /opt/R/4.5.0/lib64/R/library
    * ── Packages attached to the search path. 
   ```
   </details>
   
   
   ### Component(s)
   
   C++, R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Capacity Error when writing large list columns [arrow]

Reply via email to