hadrian-reppas commented on issue #46629:
URL: https://github.com/apache/arrow/issues/46629#issuecomment-2997813098
Hi, I'm taking a look at this issue and had a few questions:
1. What are some situations where schemas are normalized when reading the
files? The only `FragmentEvolutionStrategy` I have found is
`BasicFragmentEvolution` which dosen't handle type promotions. Or are you
talking about the call to `UnifySchemas` in `DatasetFactory::Inspect`?
2. If so, it looks like the C++ API already supports this:
```cpp
std::string path1 = "dataset/int8.parquet"; // value:
dictionary<values=string, indices=int8, ordered=0>
std::string path2 = "dataset/int16.parquet"; // value:
dictionary<values=string, indices=int16, ordered=0>
auto factory = FileSystemDatasetFactory::Make(
std::make_shared<arrow::fs::LocalFileSystem>(), {path1, path2},
std::make_shared<ParquetFileFormat>(),
FileSystemFactoryOptions{}).ValueOrDie();
InspectOptions options;
options.fragments = InspectOptions::kInspectAllFragments;
options.field_merge_options = Field::MergeOptions::Permissive();
auto schema = factory->Inspect(options).ValueOrDie(); // value:
dictionary<values=string, indices=int16, ordered=0>
auto dataset = factory->Finish(schema).ValueOrDie();
auto scanner = dataset->NewScan().ValueOrDie()->Finish().ValueOrDie();
auto table = scanner->ToTable().ValueOrDie(); // value:
dictionary<values=string, indices=int16, ordered=0>
```
It seems like doing it this way in Python is currently impossible because
the `FileSystemDatasetFactory.inspect` method [does not take an `options`
argument](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.FileSystemDatasetFactory.html#pyarrow.dataset.FileSystemDatasetFactory.inspect).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]