homatthew commented on code in PR #3818:
URL: https://github.com/apache/gobblin/pull/3818#discussion_r1379704435
##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/GobblinBaseOrcWriter.java:
##########
@@ -259,6 +261,15 @@ public void commit()
throws IOException {
closeInternal();
super.commit();
+ if(this.validateORCDuringCommit) {
+ try {
+ OrcFile.createReader(this.outputFile, new OrcFile.ReaderOptions(conf));
Review Comment:
For future readers, I think the below observation is nuanced and worth
spelling out. It may even be worth a comment.
This will work for files of size [1,3] bytes. It will not catch empty files,
which I think is a very very subtle thing. and I think that's okay as long as
users are using native ORC readers.
https://github.com/apache/orc/blob/24beffb6fed6d408e25654e53c255f564c8bd8a9/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L797C1-L802
Since it seems like part of the standard, computing engines like trino
support it
https://trino.io/blog/2019/05/29/improved-hive-bucketing.html#whats-the-problem.
For some time, presto did not support these empty files but now also does to
follow the convention of hive
##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/GobblinBaseOrcWriter.java:
##########
@@ -259,6 +261,15 @@ public void commit()
throws IOException {
closeInternal();
super.commit();
+ if(this.validateORCDuringCommit) {
+ try {
+ OrcFile.createReader(this.outputFile, new OrcFile.ReaderOptions(conf));
+ } catch (IOException ioException) {
+ log.error("Found error when validating ORC file {} during commit
phase", this.outputFile, ioException);
+ log.error("Delete the malformed ORC file is successful: {}",
this.fs.delete(this.outputFile, false));
Review Comment:
Given the severity of failing to delete this ORC file, do you think we
should retry this operation?
Check for references to retryer in the code base for an easy out of the box
impl
##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/GobblinBaseOrcWriter.java:
##########
@@ -259,6 +261,15 @@ public void commit()
throws IOException {
closeInternal();
super.commit();
Review Comment:
This line calls `FsDataWriter` and triggers the moving from staging to
output directory. Is there a reason for us to do all that work and then do the
validation?
I also wonder why this is part of the commit step and not part of the close
step. close does not call this method, but it does do the flush.
If we close and the flushed file turns out not to be valid, we will miss the
validation here.
##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/GobblinBaseOrcWriter.java:
##########
@@ -259,6 +261,15 @@ public void commit()
throws IOException {
closeInternal();
super.commit();
+ if(this.validateORCDuringCommit) {
+ try {
+ OrcFile.createReader(this.outputFile, new OrcFile.ReaderOptions(conf));
+ } catch (IOException ioException) {
+ log.error("Found error when validating ORC file {} during commit
phase", this.outputFile, ioException);
+ log.error("Delete the malformed ORC file is successful: {}",
this.fs.delete(this.outputFile, false));
Review Comment:
I also notice that the parent class `FsDataWriter` uses
HadoopUtils.deletePath. It's subtle, but perhaps we should use that method
instead when deleting because they consider some edge case where it throws an
io exception if the file exists but the file fails to delete. I'd imagine if
the FileSystem delete covered that edge case, the util method would not have
been created, but it's hard to validate when there will be an io exception
because it is not documented anywhere.
##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/GobblinBaseOrcWriter.java:
##########
@@ -259,6 +261,15 @@ public void commit()
throws IOException {
closeInternal();
super.commit();
+ if(this.validateORCDuringCommit) {
+ try {
+ OrcFile.createReader(this.outputFile, new OrcFile.ReaderOptions(conf));
+ } catch (IOException ioException) {
Review Comment:
From reading the readerimpl, it can throw 2 checked exceptions,
-
https://protobuf.dev/reference/java/api-docs/com/google/protobuf/InvalidProtocolBufferException.html
-
https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/FileFormatException.java#L25
Both of which extend IOException.
My question to you is what if there's any other runtime exception? Should we
still delete? I lean toward yes. But maybe I am missing some edge case here.
##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/GobblinBaseOrcWriter.java:
##########
@@ -259,6 +261,15 @@ public void commit()
throws IOException {
closeInternal();
super.commit();
+ if(this.validateORCDuringCommit) {
+ try {
+ OrcFile.createReader(this.outputFile, new OrcFile.ReaderOptions(conf));
+ } catch (IOException ioException) {
+ log.error("Found error when validating ORC file {} during commit
phase", this.outputFile, ioException);
+ log.error("Delete the malformed ORC file is successful: {}",
this.fs.delete(this.outputFile, false));
Review Comment:
And is there a world where this operation fails because of `File is closed`
error? Do we need to account for that edge case?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]