jonded94 commented on issue #7973:
URL: https://github.com/apache/arrow-rs/issues/7973#issuecomment-3492435673

   Just leaving this small investigation of what `pyarrow` does here (it indeed 
splits up the row group into smaller batches and if one were to try 
concatenating that, we get the same error):
   ```
   >>> import pyarrow.parquet
   >>> file = pyarrow.parquet.ParquetFile("000_00000.parquet")
   >>> file.metadata
   <pyarrow._parquet.FileMetaData object at 0x7f28f8537b00>
     created_by: parquet-cpp-arrow version 19.0.0
     num_columns: 8
     num_rows: 776000
     num_row_groups: 1
     format_version: 2.6
     serialized_size: 2665
   >>> rg = file.read_row_group(0)
   >>> rg
   pyarrow.Table
   text: string
   id: string
   edu_int_score: int64
   edu_score: double
   fasttext_score: double
   language: string
   language_score: double
   url: string
   >>> rg.num_rows
   776000
   >>> batches = list(rg.to_batches())
   >>> len(batches)
   3
   >>> [b.num_rows for b in batches]
   [351429, 355189, 69382]
   >>> c = rg.column("text")
   >>> c.combine_chunks()
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/table.pxi", line 780, in 
pyarrow.lib.ChunkedArray.combine_chunks
     File "pyarrow/array.pxi", line 5019, in pyarrow.lib.concat_arrays
     File "pyarrow/error.pxi", line 155, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays, 
consider casting input from `string` to `large_string` first.
   ```
   
   @alamb thanks for all your pointers! `StringView` really sounds like a cool 
concept!
   Casting of types during read (for example `String` into `StringView`) 
probably should be an opt-in behaviour, right? If we anyways are fine with 
reading data with a type different from how it may be was stored with, we also 
could upcast to `LargeString` directly, couldn't we?
   
   We probably want to have 2 solutions here, with the first one being that the 
`RecordBatchIterator` should intelligently lower batch size if the result 
wouldn't be expressable as Arrow data as you described, and the second one 
being this opt-in type casting?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to