homatthew commented on code in PR #3818:
URL: https://github.com/apache/gobblin/pull/3818#discussion_r1379704435


##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/GobblinBaseOrcWriter.java:
##########
@@ -259,6 +261,15 @@ public void commit()
       throws IOException {
     closeInternal();
     super.commit();
+    if(this.validateORCDuringCommit) {
+      try {
+        OrcFile.createReader(this.outputFile, new OrcFile.ReaderOptions(conf));

Review Comment:
   For future readers, I think the below observation is nuanced and worth 
spelling out. It may even be worth a comment.
   
   This will work for files of size [1,3] bytes. It will not catch empty files, 
which I think is a very very subtle thing. and I think that's okay as long as 
users are using native ORC readers.
   
   
https://github.com/apache/orc/blob/24beffb6fed6d408e25654e53c255f564c8bd8a9/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L797C1-L802
   
   Since it seems like part of the standard, computing engines like trino 
support it 
https://trino.io/blog/2019/05/29/improved-hive-bucketing.html#whats-the-problem.
 For some time, presto did not support these empty files but now also does to 
follow the convention of hive



##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/GobblinBaseOrcWriter.java:
##########
@@ -259,6 +261,15 @@ public void commit()
       throws IOException {
     closeInternal();
     super.commit();
+    if(this.validateORCDuringCommit) {
+      try {
+        OrcFile.createReader(this.outputFile, new OrcFile.ReaderOptions(conf));
+      } catch (IOException ioException) {
+        log.error("Found error when validating ORC file {} during commit 
phase", this.outputFile, ioException);
+        log.error("Delete the malformed ORC file is successful: {}", 
this.fs.delete(this.outputFile, false));

Review Comment:
   Given the severity of failing to delete this ORC file, do you think we 
should retry this operation?
   
   Check for references to retryer in the code base for an easy out of the box 
impl



##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/GobblinBaseOrcWriter.java:
##########
@@ -259,6 +261,15 @@ public void commit()
       throws IOException {
     closeInternal();
     super.commit();

Review Comment:
   This line calls `FsDataWriter` and triggers the moving from staging to 
output directory. Is there a reason for us to do all that work and then do the 
validation? 
   
   I also wonder why this is part of the commit step and not part of the close 
step. close does not call this method, but it does do the flush.
   
   If we close and the flushed file turns out not to be valid, we will miss the 
validation here. 



##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/GobblinBaseOrcWriter.java:
##########
@@ -259,6 +261,15 @@ public void commit()
       throws IOException {
     closeInternal();
     super.commit();
+    if(this.validateORCDuringCommit) {
+      try {
+        OrcFile.createReader(this.outputFile, new OrcFile.ReaderOptions(conf));
+      } catch (IOException ioException) {
+        log.error("Found error when validating ORC file {} during commit 
phase", this.outputFile, ioException);
+        log.error("Delete the malformed ORC file is successful: {}", 
this.fs.delete(this.outputFile, false));

Review Comment:
   I also notice that the parent class `FsDataWriter` uses 
HadoopUtils.deletePath.  It's subtle, but perhaps we should use that method 
instead when deleting because they consider some edge case where it throws an 
io exception if the file exists but the file fails to delete. I'd imagine if 
the FileSystem delete covered that edge case, the util method would not have 
been created, but it's hard to validate when there will be an io exception 
because it is not documented anywhere.



##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/GobblinBaseOrcWriter.java:
##########
@@ -259,6 +261,15 @@ public void commit()
       throws IOException {
     closeInternal();
     super.commit();
+    if(this.validateORCDuringCommit) {
+      try {
+        OrcFile.createReader(this.outputFile, new OrcFile.ReaderOptions(conf));
+      } catch (IOException ioException) {

Review Comment:
   From reading the readerimpl, it can throw 2 checked exceptions,
   
   - 
https://protobuf.dev/reference/java/api-docs/com/google/protobuf/InvalidProtocolBufferException.html
   
   - 
https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/FileFormatException.java#L25
   
   
   Both of which extend IOException.
   
   My question to you is what if there's any other runtime exception? Should we 
still delete? I lean toward yes. But maybe I am missing some edge case here.



##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/GobblinBaseOrcWriter.java:
##########
@@ -259,6 +261,15 @@ public void commit()
       throws IOException {
     closeInternal();
     super.commit();
+    if(this.validateORCDuringCommit) {
+      try {
+        OrcFile.createReader(this.outputFile, new OrcFile.ReaderOptions(conf));
+      } catch (IOException ioException) {
+        log.error("Found error when validating ORC file {} during commit 
phase", this.outputFile, ioException);
+        log.error("Delete the malformed ORC file is successful: {}", 
this.fs.delete(this.outputFile, false));

Review Comment:
   And is there a world where this operation fails because of `File is closed` 
error? Do we need to account for that edge case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to