[GitHub] [arrow] westonpace commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

GitBox Tue, 18 May 2021 10:50:29 -0700


westonpace commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r634619864




##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify 
a 'format'?", call. = FALSE)

Review comment:
       It seems at least there is some room for improvement.  The transition 
from `read_parquet` to `open_dataset` is at least partially in C++ as well (you 
are calling `DatasetFactory::Finish` here).  Although the terms are slightly 
different `e.g. fragments vs sources`.  So I think we could update the C++ 
error to something like "Dataset creation failed.  The fragment 
<fragment-to-string> did not match the expected <format> format: <child_error>"
   
   Concretely, "Dataset creation failed.  The fragment '/2019/July/myfile.csv' 
did not match the expected 'parquet'' format: Parquet magic bytes not found in 
footer".
   
   I worry this kind of error translation is a slippery slope.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Reply via email to