scott-routledge2 commented on code in PR #43661:
URL: https://github.com/apache/arrow/pull/43661#discussion_r2465960489


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -649,8 +718,17 @@ Result<RecordBatchGenerator> 
ParquetFileFormat::ScanBatchesAsync(
     ARROW_ASSIGN_OR_RAISE(auto generator, reader->GetRecordBatchGenerator(
                                               reader, row_groups, 
column_projection,
                                               cpu_executor, 
rows_to_readahead));
+    // We need to skip casting the dictionary columns since the dataset_schema 
doesn't
+    // have the dictionary-encoding information. Parquet reader will return 
them with the
+    // dictionary type, which is what we eventually want.
+    const std::unordered_set<std::string>& dict_cols =
+        parquet_fragment->parquet_format_.reader_options.dict_columns;
+    // Casting before slicing is more efficient. Casts on slices might require 
wasteful
+    // allocations and computation.

Review Comment:
   When casting the string offsets buffer, we allocate `offset + length + 1` 
elements, memset the first `offset` elements to zero, and perform the cast on 
`length` many elements. So assuming batch_size = 1000, a row group of 10,000 
elements would require allocating buffers of lengths 1001, 2001, ... 10,001 to 
perform the casts on all batches. 
   
   
https://github.com/apache/arrow/blob/6a2e19a852b367c72d7b12da4d104456491ed8b7/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L251
 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to