wklimowicz opened a new issue, #47169: URL: https://github.com/apache/arrow/issues/47169
### Describe the bug, including details regarding any error messages, version, and platform. When writing large list columns to parquet, arrow errors out with: ``` Error: Capacity error: List array cannot contain more than 2147483646 elements, have 1200 ``` Reproducible example, works with the CRAN `arrow` version (20.0.0.2), and the current git version (21.0.0.9000). ```r library(tibble) library(arrow) rows <- 2e6L elements_each <- 1200L tbl <- tibble( id = seq_len(rows), b = replicate(rows, list(seq_len(elements_each)), simplify = FALSE) ) write_parquet(tbl, "big_list.parquet") ``` Actual behaviour: `Error: Capacity error: List array cannot contain more than 2147483646 elements, have 1200`. Expected behaviour: Automatically chunking behind the scenes, or a suggestion of how the user should chunk manually. I think this is a similar bug to #10776, but happens with writing rather than reading. I'm looking for clarity whether this can be automatically chunked in the spirit of spirit of the [vignette](https://arrow.apache.org/docs/r/articles/data_objects.html): > An important thing to note is that “chunking” is not semantically meaningful. It is an implementation detail only: users should never treat the chunk as a meaningful unit. Alternatively a workaround would be good: I've tried some with `write_dataset`, but I don't understand the internals well enough. Two things which didn't work (same error): ```r # Approach 1: # Group by + write_dataset tbl |> dplyr::group_by(id = id %% 10L) |> # Create many groups by ID write_dataset("big_list") # Approach 2: # max_rows... tbl |> write_dataset( "big_list", max_rows_per_file = 5000L, max_rows_per_group = 5000L ) ``` <details> <summary> session_info() </summary> ``` ─ Session info ────────────────────── setting value version R version 4.5.0 (2025-04-11) os Fedora Linux 42 (Workstation Edition) system x86_64, linux-gnu ui X11 language (EN) collate en_GB.UTF-8 ctype en_GB.UTF-8 tz Europe/London date 2025-07-22 pandoc 3.1.11.1 @ /usr/bin/pandoc quarto 99.9.9 @ /home/wojtek/.local/bin/quarto ─ Packages ─────────────────────────── package * version date (UTC) lib source arrow * 21.0.0.9000 2025-07-22 [1] local assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.5.0) bit 4.6.0 2025-03-06 [1] CRAN (R 4.5.0) bit64 4.6.0-1 2025-01-16 [1] CRAN (R 4.5.0) cli 3.6.5 2025-04-23 [1] CRAN (R 4.5.0) glue 1.8.0 2024-09-30 [1] CRAN (R 4.5.0) lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.5.0) magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.5.0) pillar 1.11.0 2025-07-04 [1] CRAN (R 4.5.0) pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.5.0) purrr 1.1.0 2025-07-10 [1] CRAN (R 4.5.0) R6 2.6.1 2025-02-15 [1] CRAN (R 4.5.0) rlang 1.1.6 2025-04-11 [1] CRAN (R 4.5.0) sessioninfo 1.2.3 2025-02-05 [1] CRAN (R 4.5.0) tibble * 3.3.0 2025-06-08 [1] CRAN (R 4.5.0) tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.5.0) vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.5.0) [1] /home/wojtek/.local/share/R/x86_64-pc-linux-gnu-library/4.5 [2] /opt/R/4.5.0/lib64/R/library * ── Packages attached to the search path. ``` </details> ### Component(s) C++, R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org