Aryik commented on issue #44164:
URL: https://github.com/apache/arrow/issues/44164#issuecomment-2364749362
After trying some variations, I cannot create a reproducible example, but
I'm struggling to explain the behavior I saw with our production data.
Here's my attempt at repro, which passes without an exception:
```python
import string
import numpy as np
import pyarrow as pa
data = [
"".join(np.random.choice(list(string.ascii_letters), n))
for n in np.random.randint(10, 500, size=10_000)
]
nested_table = pa.Table.from_pydict(
{"nested": [{"string_field_1": s, "string_field_2": s} for s in data *
100]}
)
unnested_table = pa.Table.from_pydict({"string_field": data * 500})
print(
f"Trying to combine chunks of unnested table with nbytes
{unnested_table["string_field"].nbytes}"
)
unnested_table["string_field"].combine_chunks()
print(
f"Trying to combine chunks of nested table with nbytes
{nested_table["nested"].nbytes}"
)
nested_table["nested"].combine_chunks()
```
In production, we have this function:
```
def dataframe_from_dicts(
dicts: List[Dict[str, Any]],
schema: HtSchemaDict | List[str],
) -> pl.DataFrame:
"""Convert a list of dictionaries into a DataFrame with the specified
schema.
Args:
dicts (List[dict]): The list of dictionaries to convert to a
DataFrame.
schema (HtSchemaDict or List[str]): The schema of the DataFrame.
This supports either a dictionary of the schema
or a list of the schema keys if we don't know all the schema values
and the values will be inferred.
Returns:
pl.DataFrame: The DataFrame.
"""
schema_keys = list(schema.keys() if isinstance(schema, dict) else schema)
# Convert to list of tuples. We need to provide a None value for any
missing keys in the schema
# so the schema will fill in the missing values with nulls in our model
transformations.
normalized_data = [
{key: row.get(key, None) for key in schema_keys} for row in dicts
]
# Convert to Arrow Table
arrow_table = pa.Table.from_pydict(
{key: [row[key] for row in normalized_data] for key in schema_keys}
)
# Provide chunks of the Arrow Table to polars. If we have too much data
in a single chunk,
# we get these strange errors:
# pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
arrow_chunks = arrow_table.to_batches(max_chunksize=10000)
# Create Polars DataFrame from Arrow Table
return cast(
pl.DataFrame,
# Allow passing in an empty schema so that it can be inferred.
pl.from_arrow(arrow_chunks, schema=schema), # Throwing an offset
error
)
```
When I tried to call this on the nested data, it failed:
```
dataframe_from_dicts(only_additional_properties, ["additional_properties"])
# ArrowInvalid: offset overflow while concatenating arrays
```
When I called it on the unnested data, it works and returns a dataframe.
This is the thing I can't wrap my head around. If the issue is that one of the
string columns is too large, how could it work here?
```
dataframe_from_dicts(
unnested_additional_properties, unnested_additional_properties[0].keys()
)
# -> returns a polars DataFrame
```
Here are the schemas I had before. Nested:
```
Arrow Table schema: additional_properties: struct<availableforsale: string,
description: string, handle: string, onlinestoreurl: string,
primaryimageoriginalsrc: string, primaryimagesrc: string,
primaryimagetransformedsrc: string, title: string>
child 0, availableforsale: string
child 1, description: string
child 2, handle: string
child 3, onlinestoreurl: string
child 4, primaryimageoriginalsrc: string
child 5, primaryimagesrc: string
child 6, primaryimagetransformedsrc: string
child 7, title: string
```
Unnested:
```
Arrow Table schema: title: string
primaryimagesrc: string
onlinestoreurl: string
primaryimageoriginalsrc: string
handle: string
availableforsale: string
primaryimagetransformedsrc: string
description: string
```
Yes, you are correct that the column contains different strings in the
production dataset.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]