sahil1105 commented on code in PR #43661:
URL: https://github.com/apache/arrow/pull/43661#discussion_r1736878509
##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -637,8 +707,17 @@ Result<RecordBatchGenerator>
ParquetFileFormat::ScanBatchesAsync(
reader->GetRecordBatchGenerator(
reader, row_groups, column_projection,
::arrow::internal::GetCpuThreadPool(),
rows_to_readahead));
+ // We need to skip casting the dictionary columns since the dataset_schema
doesn't
+ // have the dictionary-encoding information. Parquet reader will return
them with the
+ // dictionary type, which is what we eventually want.
Review Comment:
I wasn't sure, so I left that case untouched. There are more casts done
further up the chain (e.g. in `MakeExecBatch` potentially) that seem to handle
those cases. I couldn't figure out how the dictionary case gets handled either,
so I left it as is. I'm happy to implement it here if you could point me in the
right direction.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]