[
https://issues.apache.org/jira/browse/ARROW-17828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608768#comment-17608768
]
Ben Epstein commented on ARROW-17828:
-------------------------------------
I (personally) would prefer you convert to large string and emit a warning
(converting string to large_string as column XX is over the arrow string size
limit).
I'm using vaex for all my arrow work, so using your suggestion here's how i'm
handling it (in cast others find themselves here)
{code:java}
import pyarrow as pa
import vaex
import numpy as np
from vaex.dataframe import DataFrame
n = 50_000
x = str(np.random.randint(low=0,high=1000, size=(30_000,)).tolist())
# Create a df with a string too large
df = vaex.from_arrays(
id=list(range(n)),
y=np.random.randint(low=0,high=1000,size=n)
)
df["text"] = vaex.vconstant(x, len(df))
# byte limit for arrow strings
# because 1 character = 1 byte, the total number of characters in the
# column in question must be less than the size_limit
size_limit = 2*1e9def validate_str_cols(df: DataFrame) -> DataFrame:
for col, dtype in zip(df.get_column_names(), df.dtypes):
if dtype == str and df[col].str.len().sum() >= size_limit:
df[col] = df[col].to_arrow().cast(pa.large_string())
return df
# text is type string
print(df.dtypes)
df = validate_str_cols(df)
# test is type large_string
print(df.dtypes)
# works!
y = df.text.values.combine_chunks(){code}
> [C++][Python] Large strings cause ArrowInvalid: offset overflow while
> concatenating arrays
> ------------------------------------------------------------------------------------------
>
> Key: ARROW-17828
> URL: https://issues.apache.org/jira/browse/ARROW-17828
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 9.0.0
> Reporter: Ben Epstein
> Priority: Major
> Labels: good-first-issue
>
> When working with medium-sized datasets that have very long strings, arrow
> fails when trying to operate on the strings. The root is the `combine_chunks`
> function.
> Here is a minimally reproducible example
> {code:java}
> import numpy as np
> import pyarrow as pa
> # Create a large string
> x = str(np.random.randint(low=0,high=1000, size=(30000,)).tolist())
> t = pa.chunked_array([x]*20_000)
> # Combine the chunks into large string array - fails
> combined = t.combine_chunks(){code}
> I get the following error
> {code:java}
> ---------------------------------------------------------------------------
> ArrowInvalid Traceback (most recent call last)
> /var/folders/x6/00594j4s2yv3swcn98bn8gxr0000gn/T/ipykernel_95780/4128956270.py
> in <module> ----> 1 z=t.combine_chunks()
> ~/.venv/lib/python3.7/site-packages/pyarrow/table.pxi in
> pyarrow.lib.ChunkedArray.combine_chunks()
> ~/.venv/lib/python3.7/site-packages/pyarrow/array.pxi in
> pyarrow.lib.concat_arrays()
> ~/Documents/Github/dataquality/.venv/lib/python3.7/site-packages/pyarrow/error.pxi
> in pyarrow.lib.pyarrow_internal_check_status()
> ~.venv/lib/python3.7/site-packages/pyarrow/error.pxi in
> pyarrow.lib.check_status()
> ArrowInvalid: offset overflow while concatenating arrays {code}
> With smaller strings or smaller arrays this works fine.
> {code:java}
> x = str(np.random.randint(low=0,high=1000, size=(10,)).tolist())
> t = pa.chunked_array([x]*1000)
> combined = t.combine_chunks(){code}
> The first example that fails takes a few minutes to run. If you'd like a
> faster example for experimentation, you can use `vaex` to generate the
> chunked array much faster. This will throw the identical error and will run
> about 1 second.
> {code:java}
> import vaex
> import numpy as np
> n = 50_000
> x = str(np.random.randint(low=0,high=1000, size=(30_000,)).tolist())
> df = vaex.from_arrays(
> id=list(range(n)),
> y=np.random.randint(low=0,high=1000,size=n)
> )
> df["text"] = vaex.vconstant(x, len(df))
> # text_chunk_array is now a pyarrow.lib.ChunkedArray
> text_chunk_array = df.text.values
> x = text_chunk_array.combine_chunks() {code}
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)