[I] Reading chunked `MapArray` fails for large variables (but works for smaller) [arrow]

via GitHub Mon, 30 Oct 2023 07:57:44 -0700


slobodan-ilic opened a new issue, #38513:
URL: https://github.com/apache/arrow/issues/38513


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   This bug was discovered trying to utilize pyarrow to write real-life survey 
data to parquet files (from our custom database called zz9, at Crunch.io). The 
data can be created as expected, as well as written to a parquet file. However, 
reading the data from the file into a table (with a `pq.read_table`) raises an 
unexpected error. Furthermore, the error is dependent on the size of the data, 
but the wording doesn't suggest that (but rather a nested types problem).
   
   The encountered error is:
   ```
   Traceback (most recent call last):
     File "/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/testmap.py", 
line 61, in <module>
       loaded_map_array = pq.read_table("test.parquet").column(0)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/venv/lib/python3.11/site-packages/pyarrow/parquet/core.py",
 line 3002, in read_table
       return dataset.read(columns=columns, use_threads=use_threads,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/venv/lib/python3.11/site-packages/pyarrow/parquet/core.py",
 line 2630, in read
       table = self._dataset.to_table(
               ^^^^^^^^^^^^^^^^^^^^^^^
     File "pyarrow/_dataset.pyx", line 556, in pyarrow._dataset.Dataset.to_table
     File "pyarrow/_dataset.pyx", line 3638, in 
pyarrow._dataset.Scanner.to_table
     File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
   pyarrow.lib.ArrowNotImplementedError: Nested data conversions not 
implemented for chunked array outputs
   ```
   
   The error can be reproduced by the code attached here. Depending on the 
number of rows, the error can appear or not appear. Running the code as is 
(with 200K rows) should produce the error. Here's the code for convenience:
   
   ```
   """
   Test writing/reading large map array in chunks.
   
   This example demonstrates an issue when trying to encode real-life survey 
data results
   into a map-array structure in pyarrow and saving it into a parquet file. 
Reading it back
   raises an error: `Nested data conversions not implemented for chunked array 
outputs`.
   """
   
   from typing import List
   import numpy as np
   from numpy import ndarray
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   # Parameters
   N_ROWS: int = 200000  # changing this to 100K will make the example work
   N_COLS: int = 600
   SPARSITY: float = 0.5
   CHUNK_SIZE: int = 10000
   
   # Calculate sparsity-affected column size
   N_COLS_W_VALUES: int = int(N_COLS * SPARSITY)
   
   # Generate "column" names (or keys in MapArray context)
   subrefs: List[str] = [
       f"really_really_really_long_column_name_for_a_subreference_{i}"
       for i in range(N_COLS)
   ]
   
   # Generate an index array for column names
   all_subrefs_inds: ndarray = np.arange(N_COLS)
   
   # Generate actual data (random indices) for each row/column combination
   subvar_indexes: ndarray = np.array(
       [
           np.random.choice(all_subrefs_inds, size=N_COLS_W_VALUES, 
replace=False)
           for _ in range(N_ROWS)
       ]
   ).ravel()
   
   # Generate random values between 1 and 10 for each row/column combination
   values: ndarray = np.random.randint(1, 11, size=(N_ROWS, 
N_COLS_W_VALUES)).ravel()
   
   # Generate offsets for each row
   offsets: ndarray = np.linspace(0, N_ROWS * N_COLS_W_VALUES, N_ROWS + 1, 
dtype=int)
   
   # Create DictionaryArray for keys and MapArray for the map structure
   keys = pa.DictionaryArray.from_arrays(pa.array(subvar_indexes), subrefs)
   map_array = pa.chunked_array(
       [
           pa.MapArray.from_arrays(offsets[i : i + CHUNK_SIZE + 1], keys, 
pa.array(values))
           for i in range(0, len(offsets) - 1, CHUNK_SIZE)
       ]
   )
   
   # Write and read the table
   print("Writing table")
   tbl = pa.Table.from_arrays([map_array], names=["map_array"])
   pq.write_table(tbl, "test.parquet")
   
   print("Reading table")
   loaded_map_array = pq.read_table("test.parquet").column(0)
   
   print("Successfully read the table from parquet and loaded into pyarrow.")
   ```
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Reading chunked `MapArray` fails for large variables (but works for smaller) [arrow]

Reply via email to