westonpace commented on code in PR #14444:
URL: https://github.com/apache/arrow/pull/14444#discussion_r1016037559
##########
cpp/src/arrow/dataset/discovery.cc:
##########
@@ -236,25 +236,35 @@ Result<std::vector<std::shared_ptr<Schema>>>
FileSystemDatasetFactory::InspectSc
InspectOptions options) {
std::vector<std::shared_ptr<Schema>> schemas;
+ ARROW_ASSIGN_OR_RAISE(auto partition_schema,
+ options_.partitioning.GetOrInferSchema(
+ StripPrefixAndFilename(files_,
options_.partition_base_dir)));
+
const bool has_fragments_limit = options.fragments >= 0;
int fragments = options.fragments;
for (const auto& info : files_) {
if (has_fragments_limit && fragments-- == 0) break;
auto result = format_->Inspect({info, fs_});
+
if (ARROW_PREDICT_FALSE(!result.ok())) {
return result.status().WithMessage(
"Error creating dataset. Could not read schema from '", info.path(),
"': ", result.status().message(), ". Is this a '",
format_->type_name(),
"' file?");
}
+
+ if (partition_schema->num_fields()) {
+ auto field_check =
+
result->get()->CanReferenceFieldsByNames(partition_schema->field_names());
+ if (ARROW_PREDICT_FALSE(field_check.ok())) {
+ return Status::Invalid(
+ "Error creating dataset. Partitioning field(s) present in
fragment.");
+ }
+ }
Review Comment:
This will catch a problem if the error is noticed during inspection but
sometimes we don't use all fragments to inspect a dataset and other times we
might not do dataset inspection at all (e.g. if the user provides the dataset
schema and the partitioning schema). I wonder if we might want to check
somewhere at scan time instead of during discovery.
Also, in some cases, maybe this is not always a bad thing. I seem to recall
users would sometimes store the schema information in a column in the file in
addition to the filename. Maybe a better behavior would be to silently ignore
the column in the file if there is partitioning information that specifies a
given column. Or at least to make it configurable (ignore vs error vs two
columns with the same name). @thisisnic any preference?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]