[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

GitBox Thu, 03 Mar 2022 19:41:39 -0800


openinx commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r819242211




##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +104,27 @@ public Metrics metrics() {
 
   @Override
   public long length() {
-    Preconditions.checkState(isClosed,
-        "Cannot return length while appending to an open file.");
-    return file.toInputFile().getLength();
+    if (isClosed) {
+      return file.toInputFile().getLength();
+    }
+    if (this.treeWriter == null) {
+      throw new RuntimeException("Can't get the length!");
+    }
+    long estimateMemory = this.treeWriter.estimateMemory();
+
+    long dataLength = 0;
+    try {
+      List<StripeInformation> stripes = writer.getStripes();
+      if (!stripes.isEmpty()) {
+        StripeInformation stripeInformation = stripes.get(stripes.size() - 1);
+        dataLength = stripeInformation != null ? stripeInformation.getOffset() 
+ stripeInformation.getLength() : 0;
+      }
+    } catch (IOException e) {
+      throw new UncheckedIOException(String.format("Can't get stripes from 
file %s", file.location()), e);
+    }
+
+    // This value is estimated, not actual.
+    return dataLength + estimateMemory + batch.size;

Review comment:
       I read this 
https://github.com/apache/iceberg/pull/3784#issuecomment-1022891787 here. For 
the file-persisted bytes , I think using the last strip's offset plus the its 
length should be correct.  For he the memory encoded batch vector , I think the 
`TreeWriter#estimateMemory` should be okay. 
   
   But for the batch vector whose rows did not flush to encoded memory,  using 
the batch.size shouldn't be correct. Because the rows can be any data type,  
such as Integer, Long, Timestamp, String etc. As their width are  not the same, 
I think we may need to use an average width minus the batch.size (which is row 
count actually).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] openinx commented on a change in pull request #3784: ORC:ORC supports rolling writers.

Reply via email to