[
https://issues.apache.org/jira/browse/ARROW-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368877#comment-17368877
]
Joris Van den Bossche commented on ARROW-13150:
-----------------------------------------------
Both are implemented differently. The Table version is backed by the C++
{{arrow::Table::CombineChunks}}
(https://github.com/apache/arrow/blob/998a2a1668ea57a49d85fbb38f7f0e7eb94c29db/cpp/src/arrow/table.cc#L532-L568),
which has specific handling to allow multiple chunks in the output to avoid
overflow errors.
The ChunkedArray version is backed by {{pa.concat_arrays}}, which uses the C++
{{arrow::Concatenate}} function to concatenate arrays
(https://github.com/apache/arrow/blob/998a2a1668ea57a49d85fbb38f7f0e7eb94c29db/cpp/src/arrow/array/concatenate.h#L28-L36),
which doesn't have this handling ({{CombineChunks}} is actually calling
{{Concatenate}} on each column and handling the possible overflow).
Now, that's the explanation of the difference, but the question is of course if
we want to unify that behaviour. Or for example have a keyword to indicate that
multiple chunks is also fine for the Array case.
> [Python] combine_chunks fails on column of table, but does not error on table
> itself
> ------------------------------------------------------------------------------------
>
> Key: ARROW-13150
> URL: https://issues.apache.org/jira/browse/ARROW-13150
> Project: Apache Arrow
> Issue Type: Bug
> Reporter: Gert Hulselmans
> Priority: Minor
>
> combine_chunks fails on column of table, but does not error on table itself
> (but creates 3 chunks instead).
> Is there a reason why they are not handled the same?
> {code:python}
> In [90]: pa.__version__
> Out[90]: '4.0.0'
> # Get shape
> In [85]: pa_table.shape
> Out[85]: (102753589, 1)In [86]: pa_col1_array = pa_table.column(0)
> # Get number of chunks
> In [87]: pa_col1_array.num_chunks
> Out[87]: 4404
> # Combining chunks on the pyarrow table with one column works.
> In [88]: pa_table.combine_chunks()
> Out[88]:
> pyarrow.Table
> # id=TEW__014e25__c14e1d__Multiome_RNA_brain_10x_no_perm: string
> # Combining chunks on the column itself does not work.
> In [89]: pa_col1_array.combine_chunks()
> ---------------------------------------------------------------------------
> ArrowInvalid Traceback (most recent call last)
> <ipython-input-89-fdd0d0056a8e> in <module>
> ----> 1 pa_col1_array.combine_chunks()
> /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/table.pxi
> in pyarrow.lib.ChunkedArray.combine_chunks()
> /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/array.pxi
> in pyarrow.lib.concat_arrays()
> /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/error.pxi
> in pyarrow.lib.pyarrow_internal_check_status()
> /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/error.pxi
> in pyarrow.lib.check_status()
> ArrowInvalid: offset overflow while concatenating arrays
> # Assign combine chunks table to new tabled.
> In [91]: pa_table_combined = pa_table.combine_chunks()
> # Get first column
> In [92]: pa_col1_array_from_pa_table_combined = pa_table_combined.column(0)
> # Get number of chunks
> In [93]: pa_col1_array_from_pa_table_combined.num_chunks
> Out[93]: 3
> # Try to combine column 1 again.
> In [94]: pa_col1_array_from_pa_table_combined.combine_chunks()
> ---------------------------------------------------------------------------
> ArrowInvalid Traceback (most recent call last)
> <ipython-input-94-e2e323e6519f> in <module>
> ----> 1 pa_col1_array_from_pa_table_combined.combine_chunks()
> /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/table.pxi
> in pyarrow.lib.ChunkedArray.combine_chunks()
> /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/array.pxi
> in pyarrow.lib.concat_arrays()
> /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/error.pxi
> in pyarrow.lib.pyarrow_internal_check_status()
> /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/error.pxi
> in pyarrow.lib.check_status()
> ArrowInvalid: offset overflow while concatenating arrays
> # Get sizes of each chunk.
> In [106]: [chunk.nbytes for chunk in
> pa_col1_array_from_pa_table_combined.chunks]
> Out[106]: [2341650593, 2342925682, 241257842]
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)