mapleFU commented on code in PR #43661:
URL: https://github.com/apache/arrow/pull/43661#discussion_r1715585790
##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -555,6 +562,57 @@ Future<std::shared_ptr<parquet::arrow::FileReader>>
ParquetFileFormat::GetReader
});
}
+struct CastingGenerator {
Review Comment:
> Based on what I see, that is only responsible for casting the data to the
logical type specified in the parquet metadata and not the Arrow type we want
to convert to (the one in the dataset_schema)
Parquet logical type doesn't have an arrow schema, isn't it? Binary reader
reads from `::arrow::BinaryBuilder`, and casting it to user-specified binary
type.
> For strings, that seems to always map to a String type (based on
FromByteArray which is called by GetArrowType which is called by GetTypeForNode
which is called by NodeToSchemaField which is called in SchemaManifest::Make
during the creation of the LeafReader).
Yeah, you're right, the read "cast" with file-schema rather than an expected
schema. I think a native cast is better here but this doesn't solve your
problem, perhaps I can trying to add a naive `SchemaManifest` with hint solving
here, but it would spend some time.
```
::arrow::Result<std::shared_ptr<ArrowType>> GetTypeForNode(
int column_index, const schema::PrimitiveNode& primitive_node,
SchemaTreeContext* ctx)
```
Maybe we should rethink the `GetTypeForNode` handling for
string/large_string/stringView, or using some handle written type hint here. A
casting generator is also good for me when the reader cannot provide the right
casting
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]