[GitHub] [arrow] thisisnic commented on a diff in pull request #14444: ARROW-17784: [C++] Opening a dataset where partitioning variable is already in the dataset should error differently

GitBox Tue, 08 Nov 2022 04:30:47 -0800


thisisnic commented on code in PR #14444:
URL: https://github.com/apache/arrow/pull/14444#discussion_r1016569924



##########
cpp/src/arrow/dataset/discovery.cc:
##########
@@ -236,25 +236,35 @@ Result<std::vector<std::shared_ptr<Schema>>> 
FileSystemDatasetFactory::InspectSc
     InspectOptions options) {
   std::vector<std::shared_ptr<Schema>> schemas;
 
+  ARROW_ASSIGN_OR_RAISE(auto partition_schema,
+                        options_.partitioning.GetOrInferSchema(
+                            StripPrefixAndFilename(files_, 
options_.partition_base_dir)));
+
   const bool has_fragments_limit = options.fragments >= 0;
   int fragments = options.fragments;
   for (const auto& info : files_) {
     if (has_fragments_limit && fragments-- == 0) break;
     auto result = format_->Inspect({info, fs_});
+
     if (ARROW_PREDICT_FALSE(!result.ok())) {
       return result.status().WithMessage(
           "Error creating dataset. Could not read schema from '", info.path(),
           "': ", result.status().message(), ". Is this a '", 
format_->type_name(),
           "' file?");
     }
+
+    if (partition_schema->num_fields()) {
+      auto field_check =
+          
result->get()->CanReferenceFieldsByNames(partition_schema->field_names());
+      if (ARROW_PREDICT_FALSE(field_check.ok())) {
+        return Status::Invalid(
+            "Error creating dataset. Partitioning field(s) present in 
fragment.");
+      }
+    }

Review Comment:
   My preference would be either silently ignoring it or making it configurable 
(with one of the configurable options being to silently ignore it).
   
   Users might inherit poorly-designed datasets that they have no control over 
how they've been written, and I'd hate for it to be impossible for them to be 
able to work with them in Arrow and just get an error.
   
   I'm not sure how/if having a dataset with 2 columns with the same name would 
work in R as I think it'd trigger an error somewhere (if it didn't, I think 
we'd maybe want to make it do so?), but if it'd be useful to have that as a 
feature, we can just catch it in R and add our own custom error, so I don't see 
a problem in that.
   
   Silently ignoring would be fine too, as we can always add our own error in R 
if we want to warn users that there's duplication, and it at least allows 
execution.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] thisisnic commented on a diff in pull request #14444: ARROW-17784: [C++] Opening a dataset where partitioning variable is already in the dataset should error differently

Reply via email to