kbendick commented on a change in pull request #3784:
URL: https://github.com/apache/iceberg/pull/3784#discussion_r833663292
##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +112,28 @@ public Metrics metrics() {
@Override
public long length() {
- Preconditions.checkState(isClosed,
- "Cannot return length while appending to an open file.");
- return file.toInputFile().getLength();
+ if (isClosed) {
+ return file.toInputFile().getLength();
+ }
+
+ Preconditions.checkNotNull(treeWriter, "Can't get three writer!");
Review comment:
Nit: Typo in `tree writer` (unneeded h).
Additionally, we typically format our exception messages as `{Problem}
{Cause}`. And we try to make them readable for end users. It's unlikely that
casual end users know what `tree writer` is. So something more along the lines
of `Cannot estimate length of file being written as the ORC writer's internal
writer is not present` might be more useful.
##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -99,9 +112,28 @@ public Metrics metrics() {
@Override
public long length() {
- Preconditions.checkState(isClosed,
- "Cannot return length while appending to an open file.");
- return file.toInputFile().getLength();
+ if (isClosed) {
+ return file.toInputFile().getLength();
+ }
+
+ Preconditions.checkNotNull(treeWriter, "Can't get three writer!");
+
+ long estimateMemory = treeWriter.estimateMemory();
+
+ long dataLength = 0;
+ try {
+ List<StripeInformation> stripes = writer.getStripes();
+ if (!stripes.isEmpty()) {
+ StripeInformation stripeInformation = stripes.get(stripes.size() - 1);
+ dataLength = stripeInformation != null ? stripeInformation.getOffset()
+ stripeInformation.getLength() : 0;
+ }
+ } catch (IOException e) {
+ throw new UncheckedIOException(String.format("Can't get strip's length
from the file writer with path: %s.",
Review comment:
Nit: Typo - `Stripe's` not `strip's`.
##########
File path:
orc/src/test/java/org/apache/iceberg/orc/TestEstimateOrcAvgWidthVisitor.java
##########
@@ -0,0 +1,218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.orc;
+
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.types.Types;
+import org.apache.orc.TypeDescription;
+import org.junit.Assert;
+import org.junit.Test;
+
+import static org.apache.iceberg.types.Types.NestedField.optional;
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+public class TestEstimateOrcAvgWidthVisitor {
+
+ // all supported fields
+ protected static final Types.NestedField ID_FIELD = required(1, "id",
Types.IntegerType.get());
+ protected static final Types.NestedField DATA_FIELD = optional(2, "data",
Types.StringType.get());
+ protected static final Types.NestedField FLOAT_FIELD = required(3, "float",
Types.FloatType.get());
+ protected static final Types.NestedField DOUBLE_FIELD = optional(4,
"double", Types.DoubleType.get());
+ protected static final Types.NestedField DECIMAL_FIELD = optional(5,
"decimal", Types.DecimalType.of(5, 3));
+ protected static final Types.NestedField FIXED_FIELD = optional(7, "fixed",
Types.FixedType.ofLength(4));
+ protected static final Types.NestedField BINARY_FIELD = optional(8,
"binary", Types.BinaryType.get());
+ protected static final Types.NestedField FLOAT_LIST_FIELD = optional(9,
"floatList",
+ Types.ListType.ofRequired(10, Types.FloatType.get()));
+ protected static final Types.NestedField LONG_FIELD = optional(11, "long",
Types.LongType.get());
+ protected static final Types.NestedField BOOLEAN_FIELD = optional(12,
"boolean", Types.BooleanType.get());
+ protected static final Types.NestedField TIMESTAMP_ZONE_FIELD = optional(13,
"timestampZone",
+ Types.TimestampType.withZone());
+ protected static final Types.NestedField TIMESTAMP_FIELD = optional(14,
"timestamp",
+ Types.TimestampType.withoutZone());
+ protected static final Types.NestedField DATE_FIELD = optional(15, "date",
Types.DateType.get());
+ protected static final Types.NestedField UUID_FIELD = required(16, "uuid",
Types.UUIDType.get());
+
+ protected static final Types.NestedField MAP_FIELD_1 = optional(17, "map1",
+ Types.MapType.ofOptional(18, 19, Types.FloatType.get(),
Types.StringType.get())
+ );
+ protected static final Types.NestedField MAP_FIELD_2 = optional(20, "map2",
+ Types.MapType.ofOptional(21, 22, Types.IntegerType.get(),
Types.DoubleType.get())
+ );
+ protected static final Types.NestedField STRUCT_FIELD = optional(23,
"struct", Types.StructType.of(
+ required(24, "booleanField", Types.BooleanType.get()),
+ optional(25, "date", Types.DateType.get()),
+ optional(27, "timestamp", Types.TimestampType.withZone())
+ ));
+
+ @Test
+ public void testEstimateDataAveWidth() {
+ // create a schema with integer field
+ Schema schemaWithInteger = new Schema(
+ ID_FIELD
+ );
+ TypeDescription orcSchemaWithInteger =
ORCSchemaUtil.convert(schemaWithInteger);
+ long estimateLength = getEstimateLength(orcSchemaWithInteger);
+ Assert.assertEquals("Estimated average length of integer must be 8.", 8,
estimateLength);
+
+ // create a schema with string field
+ Schema schemaWithString = new Schema(
+ DATA_FIELD
+ );
+ TypeDescription orcSchemaWithString =
ORCSchemaUtil.convert(schemaWithString);
+ estimateLength = getEstimateLength(orcSchemaWithString);
+ Assert.assertEquals("Estimated average length of string must be 128.",
128, estimateLength);
+
+ // create a schema with float field
+ Schema schemaWithFloat = new Schema(
+ FLOAT_FIELD
+ );
+ TypeDescription orcSchemaWithFloat =
ORCSchemaUtil.convert(schemaWithFloat);
+ estimateLength = getEstimateLength(orcSchemaWithFloat);
+ Assert.assertEquals("Estimated average length of float must be 8.", 8,
estimateLength);
+
+ // create a schema with double field
+ Schema schemaWithDouble = new Schema(
+ DOUBLE_FIELD
+ );
+ TypeDescription orcSchemaWithDouble =
ORCSchemaUtil.convert(schemaWithDouble);
+ estimateLength = getEstimateLength(orcSchemaWithDouble);
+ Assert.assertEquals("Estimated average length of double must be 8.", 8,
estimateLength);
+
+ // create a schema with decimal field
+ Schema schemaWithDecimal = new Schema(
+ DECIMAL_FIELD
+ );
+ TypeDescription orcSchemaWithDecimal =
ORCSchemaUtil.convert(schemaWithDecimal);
+ estimateLength = getEstimateLength(orcSchemaWithDecimal);
+ Assert.assertEquals("Estimated average length of decimal must be 7.", 7,
estimateLength);
+
+ // create a schema with fixed field
+ Schema schemaWithFixed = new Schema(
+ FIXED_FIELD
+ );
+ TypeDescription orcSchemaWithFixed =
ORCSchemaUtil.convert(schemaWithFixed);
+ estimateLength = getEstimateLength(orcSchemaWithFixed);
+ Assert.assertEquals("Estimated average length of fixed must be 128.", 128,
estimateLength);
+
+ // create a schema with binary field
+ Schema schemaWithBinary = new Schema(
+ BINARY_FIELD
+ );
+ TypeDescription orcSchemaWithBinary =
ORCSchemaUtil.convert(schemaWithBinary);
+ estimateLength = getEstimateLength(orcSchemaWithBinary);
+ Assert.assertEquals("Estimated average length of binary must be 128.",
128, estimateLength);
+
+ // create a schema with list field
+ Schema schemaWithList = new Schema(
+ FLOAT_LIST_FIELD
+ );
+ TypeDescription orcSchemaWithList = ORCSchemaUtil.convert(schemaWithList);
+ estimateLength = getEstimateLength(orcSchemaWithList);
+ Assert.assertEquals("Estimated average length of list must be 8.", 8,
estimateLength);
+
+ // create a schema with long field
+ Schema schemaWithLong = new Schema(
+ LONG_FIELD
+ );
+ TypeDescription orcSchemaWithLong = ORCSchemaUtil.convert(schemaWithLong);
+ estimateLength = getEstimateLength(orcSchemaWithLong);
+ Assert.assertEquals("Estimated average length of long must be 8.", 8,
estimateLength);
+
+ // create a schema with boolean field
+ Schema schemaWithBoolean = new Schema(
+ BOOLEAN_FIELD
+ );
Review comment:
Nit: These can be on one line, like `Schema booSchema = new
Schema(BOOLEAN_FIELD)`. The comment above each one is also unnecessary.
The same applies to all of the `Schema` in this test (and possibly
elsewhere). There's no need to use multiple lines here.
##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java
##########
@@ -66,6 +71,11 @@
this.metricsConfig = metricsConfig;
TypeDescription orcSchema = ORCSchemaUtil.convert(schema);
+
+ this.avgRowByteSize =
+ OrcSchemaVisitor.visitSchema(orcSchema, new
EstimateOrcAvgWidthVisitor()).stream().reduce(Integer::sum)
+ .orElse(0);
Review comment:
The use of `orElse(0)` concerns me somewhat.
Looking at its usage, it seems as though using `avgRowByteSize` of 0 would
mean that the entirety of `batch.size` would be unaccounted for in the estimate
in the `length` function.
```java
return (long) (dataLength + (estimateMemory + (long) batch.size *
avgRowByteSize) * 0.2);
```
Under what situations would we _expect_ this to reasonably return 0? Is that
possible / expected in some edge case, or more indicative of a bug?
Would it make sense to default to some non-zero value (even 1) so that the
ongoing batch.size isn't entirely dropped?
At the very least, it seems like we should potentially log a debug message
stating that `0` is being used. If user's are investigating ORC files being
written at sizes they find strange, having a log would be beneficial.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]