joellubi commented on PR #43066: URL: https://github.com/apache/arrow/pull/43066#issuecomment-2211258792
> @joellubi For arrow C++ usecase, encode is implemented like this patch now, decode, however is implemented by batch: Thanks @mapleFU. I think it would be nice to keep the behavior aligned but there is a slight difference between how Go and cpp implementations batch reads. In cpp, the [ReadValues](https://github.com/apache/arrow/blob/5b5c164a6a467af2803e927b2de1b9b6ee5de895/cpp/src/parquet/column_reader.cc#L664-L671) method reads "up to batch_size values from the current data page". In Go, the [readBatch](https://github.com/apache/arrow/blob/5b5c164a6a467af2803e927b2de1b9b6ee5de895/go/parquet/file/column_reader.go#L487-L527) method "will read until it either reads in batchSize values or it hits the end of the column chunk, including reading multiple pages". Since all values must be decoded within the window of a single page, it's safe to decode the page when `SetData` is called in Go but an entire batch in general may span multiple pages. In cpp the values read in a single batch is limited to the values left in the current page, so it's safe to read in separate batches without crossing a page boundary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
