slobodan-ilic opened a new issue, #38513:
URL: https://github.com/apache/arrow/issues/38513
### Describe the bug, including details regarding any error messages,
version, and platform.
This bug was discovered trying to utilize pyarrow to write real-life survey
data to parquet files (from our custom database called zz9, at Crunch.io). The
data can be created as expected, as well as written to a parquet file. However,
reading the data from the file into a table (with a `pq.read_table`) raises an
unexpected error. Furthermore, the error is dependent on the size of the data,
but the wording doesn't suggest that (but rather a nested types problem).
The encountered error is:
```
Traceback (most recent call last):
File "/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/testmap.py",
line 61, in <module>
loaded_map_array = pq.read_table("test.parquet").column(0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/venv/lib/python3.11/site-packages/pyarrow/parquet/core.py",
line 3002, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/venv/lib/python3.11/site-packages/pyarrow/parquet/core.py",
line 2630, in read
table = self._dataset.to_table(
^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 556, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3638, in
pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not
implemented for chunked array outputs
```
The error can be reproduced by the code attached here. Depending on the
number of rows, the error can appear or not appear. Running the code as is
(with 200K rows) should produce the error. Here's the code for convenience:
```
"""
Test writing/reading large map array in chunks.
This example demonstrates an issue when trying to encode real-life survey
data results
into a map-array structure in pyarrow and saving it into a parquet file.
Reading it back
raises an error: `Nested data conversions not implemented for chunked array
outputs`.
"""
from typing import List
import numpy as np
from numpy import ndarray
import pyarrow as pa
import pyarrow.parquet as pq
# Parameters
N_ROWS: int = 200000 # changing this to 100K will make the example work
N_COLS: int = 600
SPARSITY: float = 0.5
CHUNK_SIZE: int = 10000
# Calculate sparsity-affected column size
N_COLS_W_VALUES: int = int(N_COLS * SPARSITY)
# Generate "column" names (or keys in MapArray context)
subrefs: List[str] = [
f"really_really_really_long_column_name_for_a_subreference_{i}"
for i in range(N_COLS)
]
# Generate an index array for column names
all_subrefs_inds: ndarray = np.arange(N_COLS)
# Generate actual data (random indices) for each row/column combination
subvar_indexes: ndarray = np.array(
[
np.random.choice(all_subrefs_inds, size=N_COLS_W_VALUES,
replace=False)
for _ in range(N_ROWS)
]
).ravel()
# Generate random values between 1 and 10 for each row/column combination
values: ndarray = np.random.randint(1, 11, size=(N_ROWS,
N_COLS_W_VALUES)).ravel()
# Generate offsets for each row
offsets: ndarray = np.linspace(0, N_ROWS * N_COLS_W_VALUES, N_ROWS + 1,
dtype=int)
# Create DictionaryArray for keys and MapArray for the map structure
keys = pa.DictionaryArray.from_arrays(pa.array(subvar_indexes), subrefs)
map_array = pa.chunked_array(
[
pa.MapArray.from_arrays(offsets[i : i + CHUNK_SIZE + 1], keys,
pa.array(values))
for i in range(0, len(offsets) - 1, CHUNK_SIZE)
]
)
# Write and read the table
print("Writing table")
tbl = pa.Table.from_arrays([map_array], names=["map_array"])
pq.write_table(tbl, "test.parquet")
print("Reading table")
loaded_map_array = pq.read_table("test.parquet").column(0)
print("Successfully read the table from parquet and loaded into pyarrow.")
```
### Component(s)
Parquet, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]