Re: [PR] GH-43660: [C++] Add a `CastingGenerator` to Parquet Reader that applies required casts before slicing [arrow]

via GitHub Tue, 13 Aug 2024 09:28:45 -0700


mapleFU commented on code in PR #43661:
URL: https://github.com/apache/arrow/pull/43661#discussion_r1715585790



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -555,6 +562,57 @@ Future<std::shared_ptr<parquet::arrow::FileReader>> 
ParquetFileFormat::GetReader
       });
 }
 
+struct CastingGenerator {

Review Comment:
   > Based on what I see, that is only responsible for casting the data to the 
logical type specified in the parquet metadata and not the Arrow type we want 
to convert to (the one in the dataset_schema)
   
   Parquet logical type doesn't have an arrow schema, isn't it? Binary reader 
reads from `::arrow::BinaryBuilder`, and casting it to user-specified binary 
type.
   
   > For strings, that seems to always map to a String type (based on 
FromByteArray which is called by GetArrowType which is called by GetTypeForNode 
which is called by NodeToSchemaField which is called in SchemaManifest::Make 
during the creation of the LeafReader).
   
   Yeah, you're right, the read "cast" with file-schema rather than an expected 
schema. I think a native cast is better here but this doesn't solve your 
problem, perhaps I can trying to add a naive `SchemaManifest` with hint solving 
here, but it would spend some time.
   
   ```
   ::arrow::Result<std::shared_ptr<ArrowType>> GetTypeForNode(
       int column_index, const schema::PrimitiveNode& primitive_node,
       SchemaTreeContext* ctx)
   ```
   
   Maybe we should rethink the `GetTypeForNode` handling for 
string/large_string/stringView, or using some handle written type hint here. A 
casting generator is also good for me when the reader cannot provide the right 
casting



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-43660: [C++] Add a `CastingGenerator` to Parquet Reader that applies required casts before slicing [arrow]

Reply via email to