jonded94 commented on issue #7973:
URL: https://github.com/apache/arrow-rs/issues/7973#issuecomment-3492435673
Just leaving this small investigation of what `pyarrow` does here (it indeed
splits up the row group into smaller batches and if one were to try
concatenating that, we get the same error):
```
>>> import pyarrow.parquet
>>> file = pyarrow.parquet.ParquetFile("000_00000.parquet")
>>> file.metadata
<pyarrow._parquet.FileMetaData object at 0x7f28f8537b00>
created_by: parquet-cpp-arrow version 19.0.0
num_columns: 8
num_rows: 776000
num_row_groups: 1
format_version: 2.6
serialized_size: 2665
>>> rg = file.read_row_group(0)
>>> rg
pyarrow.Table
text: string
id: string
edu_int_score: int64
edu_score: double
fasttext_score: double
language: string
language_score: double
url: string
>>> rg.num_rows
776000
>>> batches = list(rg.to_batches())
>>> len(batches)
3
>>> [b.num_rows for b in batches]
[351429, 355189, 69382]
>>> c = rg.column("text")
>>> c.combine_chunks()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/table.pxi", line 780, in
pyarrow.lib.ChunkedArray.combine_chunks
File "pyarrow/array.pxi", line 5019, in pyarrow.lib.concat_arrays
File "pyarrow/error.pxi", line 155, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays,
consider casting input from `string` to `large_string` first.
```
@alamb thanks for all your pointers! `StringView` really sounds like a cool
concept!
Casting of types during read (for example `String` into `StringView`)
probably should be an opt-in behaviour, right? If we anyways are fine with
reading data with a type different from how it may be was stored with, we also
could upcast to `LargeString` directly, couldn't we?
We probably want to have 2 solutions here, with the first one being that the
`RecordBatchIterator` should intelligently lower batch size if the result
wouldn't be expressable as Arrow data as you described, and the second one
being this opt-in type casting?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]