bonsairobo opened a new issue, #7129:
URL: https://github.com/apache/arrow-rs/issues/7129
**Describe the bug**
I tried to read a single column into a single `RecordBatch`, but instead I
got 96 batches back, even though I configured the reader with an "unlimited"
`batch_size`.
**To Reproduce**
```rust
use arrow::array::{ArrayRef, Int64Array, RecordBatch};
use arrow::datatypes::{DataType, Field, SchemaBuilder};
use futures::TryStreamExt;
use object_store::local::LocalFileSystem;
use object_store::path::Path;
use object_store::ObjectStore;
use parquet::arrow::async_reader::ParquetObjectReader;
use parquet::arrow::async_writer::ParquetObjectWriter;
use parquet::arrow::{AsyncArrowWriter, ParquetRecordBatchStreamBuilder,
ProjectionMask};
use std::sync::Arc;
#[tokio::test]
async fn write_parquet() {
let store = Arc::new(LocalFileSystem::new());
let mut schema = SchemaBuilder::new();
schema.push(Field::new("col1", DataType::Int64, false));
let schema = Arc::new(schema.finish());
let file_writer = ParquetObjectWriter::new(store.clone(),
Path::from("part1"));
let mut writer = AsyncArrowWriter::try_new(file_writer, schema,
None).unwrap();
let n_rows = 100_000_000;
let col1 = Arc::new(Int64Array::from_iter_values(
(0..n_rows).collect::<Vec<_>>(),
)) as ArrayRef;
let to_write = RecordBatch::try_from_iter([("col1", col1)]).unwrap();
writer.write(&to_write).await.unwrap();
writer.close().await.unwrap();
let obj_meta = store.head(&Path::from("part1")).await.unwrap();
let builder =
ParquetRecordBatchStreamBuilder::new(ParquetObjectReader::new(store.clone(),
obj_meta))
.await
.unwrap();
let file_meta = builder.metadata().file_metadata();
let mask = ProjectionMask::columns(file_meta.schema_descr(), ["col1"]);
let batch_stream = builder
.with_batch_size(usize::MAX) // force reading into a single batch
.with_projection(mask)
.build()
.unwrap();
let batches: Vec<_> = batch_stream.try_collect().await.unwrap();
assert_eq!(batches.len(), 1);
let partition = batches.into_iter().next().unwrap();
assert_eq!(partition.columns().len(), 1);
assert_eq!(partition.columns()[0].len(), n_rows as usize);
}
```
Should output:
```
---- write_parquet stdout ----
thread 'write_parquet' panicked at test.rs:46:5:
assertion `left == right` failed
left: 96
right: 1
```
**Expected behavior**
I expected to get only one batch containing my entire column.
**Additional context**
This is an important use case so I don't have to concatenate the batches
after the fact, using lots of additional memory to do so.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]