Re: [I] [Python][C++] Offset Overflow when calling combine_chunks on Large Struct Arrays [arrow]

via GitHub Fri, 20 Sep 2024 16:31:10 -0700


Aryik commented on issue #44164:
URL: https://github.com/apache/arrow/issues/44164#issuecomment-2364749362


   After trying some variations, I cannot create a reproducible example, but 
I'm struggling to explain the behavior I saw with our production data. 
   
   Here's my attempt at repro, which passes without an exception:
   ```python
   import string
   
   import numpy as np
   import pyarrow as pa
   
   data = [
       "".join(np.random.choice(list(string.ascii_letters), n))
       for n in np.random.randint(10, 500, size=10_000)
   ]
   
   
   nested_table = pa.Table.from_pydict(
       {"nested": [{"string_field_1": s, "string_field_2": s} for s in data * 
100]}
   )
   
   unnested_table = pa.Table.from_pydict({"string_field": data * 500})
   
   print(
       f"Trying to combine chunks of unnested table with nbytes 
{unnested_table["string_field"].nbytes}"
   )
   unnested_table["string_field"].combine_chunks()
   
   print(
       f"Trying to combine chunks of nested table with nbytes 
{nested_table["nested"].nbytes}"
   )
   nested_table["nested"].combine_chunks()
   ```
   
   In production, we have this function:
   ```
   def dataframe_from_dicts(
       dicts: List[Dict[str, Any]],
       schema: HtSchemaDict | List[str],
   ) -> pl.DataFrame:
       """Convert a list of dictionaries into a DataFrame with the specified 
schema.
       Args:
           dicts (List[dict]): The list of dictionaries to convert to a 
DataFrame.
           schema (HtSchemaDict or List[str]): The schema of the DataFrame. 
This supports either a dictionary of the schema
           or a list of the schema keys if we don't know all the schema values 
and the values will be inferred.
       Returns:
           pl.DataFrame: The DataFrame.
       """
   
       schema_keys = list(schema.keys() if isinstance(schema, dict) else schema)
   
       # Convert to list of tuples. We need to provide a None value for any 
missing keys in the schema
       # so the schema will fill in the missing values with nulls in our model 
transformations.
       normalized_data = [
           {key: row.get(key, None) for key in schema_keys} for row in dicts
       ]
   
       # Convert to Arrow Table
       arrow_table = pa.Table.from_pydict(
           {key: [row[key] for row in normalized_data] for key in schema_keys}
       )
   
       # Provide chunks of the Arrow Table to polars. If we have too much data 
in a single chunk,
       # we get these strange errors:
       # pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
       arrow_chunks = arrow_table.to_batches(max_chunksize=10000)
   
       # Create Polars DataFrame from Arrow Table
       return cast(
           pl.DataFrame,
           # Allow passing in an empty schema so that it can be inferred.
           pl.from_arrow(arrow_chunks, schema=schema),  # Throwing an offset 
error
       )
   ```
   
   When I tried to call this on the nested data, it failed:
   ```
   dataframe_from_dicts(only_additional_properties, ["additional_properties"])
   # ArrowInvalid: offset overflow while concatenating arrays
   ```
   
   When I called it on the unnested data, it works and returns a dataframe. 
This is the thing I can't wrap my head around. If the issue is that one of the 
string columns is too large, how could it work here?
   ```
   dataframe_from_dicts(
       unnested_additional_properties, unnested_additional_properties[0].keys()
   )
   # -> returns a polars DataFrame
   ```
   
   Here are the schemas I had before. Nested:
   ```
   Arrow Table schema: additional_properties: struct<availableforsale: string, 
description: string, handle: string, onlinestoreurl: string, 
primaryimageoriginalsrc: string, primaryimagesrc: string, 
primaryimagetransformedsrc: string, title: string>
     child 0, availableforsale: string
     child 1, description: string
     child 2, handle: string
     child 3, onlinestoreurl: string
     child 4, primaryimageoriginalsrc: string
     child 5, primaryimagesrc: string
     child 6, primaryimagetransformedsrc: string
     child 7, title: string
   ```
   
   Unnested:
   ```
   Arrow Table schema: title: string
   primaryimagesrc: string
   onlinestoreurl: string
   primaryimageoriginalsrc: string
   handle: string
   availableforsale: string
   primaryimagetransformedsrc: string
   description: string
   ```
   
   Yes, you are correct that the column contains different strings in the 
production dataset.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python][C++] Offset Overflow when calling combine_chunks on Large Struct Arrays [arrow]

Reply via email to