[GitHub] [arrow] westonpace commented on a diff in pull request #14444: ARROW-17784: [C++] Opening a dataset where partitioning variable is already in the dataset should error differently

GitBox Mon, 07 Nov 2022 16:49:37 -0800


westonpace commented on code in PR #14444:
URL: https://github.com/apache/arrow/pull/14444#discussion_r1016037559



##########
cpp/src/arrow/dataset/discovery.cc:
##########
@@ -236,25 +236,35 @@ Result<std::vector<std::shared_ptr<Schema>>> 
FileSystemDatasetFactory::InspectSc
     InspectOptions options) {
   std::vector<std::shared_ptr<Schema>> schemas;
 
+  ARROW_ASSIGN_OR_RAISE(auto partition_schema,
+                        options_.partitioning.GetOrInferSchema(
+                            StripPrefixAndFilename(files_, 
options_.partition_base_dir)));
+
   const bool has_fragments_limit = options.fragments >= 0;
   int fragments = options.fragments;
   for (const auto& info : files_) {
     if (has_fragments_limit && fragments-- == 0) break;
     auto result = format_->Inspect({info, fs_});
+
     if (ARROW_PREDICT_FALSE(!result.ok())) {
       return result.status().WithMessage(
           "Error creating dataset. Could not read schema from '", info.path(),
           "': ", result.status().message(), ". Is this a '", 
format_->type_name(),
           "' file?");
     }
+
+    if (partition_schema->num_fields()) {
+      auto field_check =
+          
result->get()->CanReferenceFieldsByNames(partition_schema->field_names());
+      if (ARROW_PREDICT_FALSE(field_check.ok())) {
+        return Status::Invalid(
+            "Error creating dataset. Partitioning field(s) present in 
fragment.");
+      }
+    }

Review Comment:
   This will catch a problem if the error is noticed during inspection but 
sometimes we don't use all fragments to inspect a dataset and other times we 
might not do dataset inspection at all (e.g. if the user provides the dataset 
schema and the partitioning schema).  I wonder if we might want to check 
somewhere at scan time instead of during discovery.
   
   Also, in some cases, maybe this is not always a bad thing.  I seem to recall 
users would sometimes store the schema information in a column in the file in 
addition to the filename.  Maybe a better behavior would be to silently ignore 
the column in the file if there is partitioning information that specifies a 
given column.  Or at least to make it configurable (ignore vs error vs two 
columns with the same name).  @thisisnic any preference?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a diff in pull request #14444: ARROW-17784: [C++] Opening a dataset where partitioning variable is already in the dataset should error differently

Reply via email to