Re: [PR] GH-43660: [C++] Add a `CastingGenerator` to Parquet Reader that applies required casts before slicing [arrow]

via GitHub Thu, 29 Aug 2024 11:21:22 -0700


sahil1105 commented on code in PR #43661:
URL: https://github.com/apache/arrow/pull/43661#discussion_r1736878509



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -637,8 +707,17 @@ Result<RecordBatchGenerator> 
ParquetFileFormat::ScanBatchesAsync(
                           reader->GetRecordBatchGenerator(
                               reader, row_groups, column_projection,
                               ::arrow::internal::GetCpuThreadPool(), 
rows_to_readahead));
+    // We need to skip casting the dictionary columns since the dataset_schema 
doesn't
+    // have the dictionary-encoding information. Parquet reader will return 
them with the
+    // dictionary type, which is what we eventually want.

Review Comment:
   I wasn't sure, so I left that case untouched. There are more casts done 
further up the chain (e.g. in `MakeExecBatch` potentially) that seem to handle 
those cases. I couldn't figure out how the dictionary case gets handled either, 
so I left it as is. I'm happy to implement it here if you could point me in the 
right direction.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-43660: [C++] Add a `CastingGenerator` to Parquet Reader that applies required casts before slicing [arrow]

Reply via email to