[GitHub] [iceberg] lxynov commented on a change in pull request #199: ORC metrics

GitBox Mon, 15 Jun 2020 11:52:35 -0700


lxynov commented on a change in pull request #199:
URL: https://github.com/apache/iceberg/pull/199#discussion_r440378377




##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java
##########
@@ -40,30 +67,211 @@ public static Metrics fromInputFile(InputFile file) {
     return fromInputFile(file, config);
   }
 
-  public static Metrics fromInputFile(InputFile file, Configuration config) {
+  static Metrics fromInputFile(InputFile file, Configuration config) {
     try (Reader orcReader = ORC.newFileReader(file, config)) {
-
-      // TODO: implement rest of the methods for ORC metrics
-      // https://github.com/apache/incubator-iceberg/pull/199
-      return new Metrics(orcReader.getNumberOfRows(),
-          null,
-          null,
-          Collections.emptyMap(),
-          null,
-          null);
+      return buildOrcMetrics(orcReader.getNumberOfRows(), 
orcReader.getSchema(), orcReader.getStatistics());
     } catch (IOException ioe) {
-      throw new RuntimeIOException(ioe, "Failed to read footer of file: %s", 
file);
+      throw new RuntimeIOException(ioe, "Failed to open file: %s", 
file.location());
     }
   }
 
   static Metrics fromWriter(Writer writer) {
-    // TODO: implement rest of the methods for ORC metrics in
-    // https://github.com/apache/incubator-iceberg/pull/199
-    return new Metrics(writer.getNumberOfRows(),
-        null,
-        null,
-        Collections.emptyMap(),
-        null,
-        null);
+    try {
+      return buildOrcMetrics(writer.getNumberOfRows(), writer.getSchema(), 
writer.getStatistics());
+    } catch (IOException ioe) {
+      throw new RuntimeIOException(ioe, "Failed to get statistics from 
writer");
+    }
+  }
+
+  private static Metrics buildOrcMetrics(final long numOfRows, final 
TypeDescription orcSchema,
+                                         final ColumnStatistics[] colStats) {
+    final Schema schema = ORCSchemaUtil.convert(orcSchema);
+    final Set<TypeDescription> columnsInContainers = 
findColumnsInContainers(schema, orcSchema);
+    Map<Integer, Long> columnSizes = 
Maps.newHashMapWithExpectedSize(colStats.length);
+    Map<Integer, Long> valueCounts = 
Maps.newHashMapWithExpectedSize(colStats.length);
+    Map<Integer, Long> nullCounts = 
Maps.newHashMapWithExpectedSize(colStats.length);
+    Map<Integer, ByteBuffer> lowerBounds = Maps.newHashMap();
+    Map<Integer, ByteBuffer> upperBounds = Maps.newHashMap();
+
+    for (int i = 0; i < colStats.length; i++) {
+      final ColumnStatistics colStat = colStats[i];
+      final TypeDescription orcCol = orcSchema.findSubtype(i);
+      final Optional<Types.NestedField> icebergColOpt = 
ORCSchemaUtil.icebergID(orcCol)
+          .map(schema::findField);
+
+      if (icebergColOpt.isPresent()) {
+        final Types.NestedField icebergCol = icebergColOpt.get();
+        final int fieldId = icebergCol.fieldId();
+
+        columnSizes.put(fieldId, colStat.getBytesOnDisk());
+
+        if (!columnsInContainers.contains(orcCol)) {
+          // Since ORC does not track null values nor repeated ones, the value 
count for columns in
+          // containers (maps, list) may be larger than what it actually is, 
however these are not
+          // used in experssions right now. For such cases, we use the value 
number of values
+          // directly stored in ORC.
+          if (colStat.hasNull()) {
+            nullCounts.put(fieldId, numOfRows - colStat.getNumberOfValues());
+          } else {
+            nullCounts.put(fieldId, 0L);
+          }
+          valueCounts.put(fieldId, colStat.getNumberOfValues() + 
nullCounts.get(fieldId));
+
+          Optional<ByteBuffer> orcMin = (colStat.getNumberOfValues() > 0) ?
+              fromOrcMin(icebergCol, colStat) : Optional.empty();
+          orcMin.ifPresent(byteBuffer -> lowerBounds.put(icebergCol.fieldId(), 
byteBuffer));
+          Optional<ByteBuffer> orcMax = (colStat.getNumberOfValues() > 0) ?
+              fromOrcMax(icebergCol, colStat) : Optional.empty();
+          orcMax.ifPresent(byteBuffer -> upperBounds.put(icebergCol.fieldId(), 
byteBuffer));

Review comment:
       @edgarRd @rdblue @rdsr @shardulm94  In ORC, the column stats will have 
min/max values even when there're null values within the same file. (See 
[here](https://github.com/apache/orc/blob/73ba385f9534e1d919402ae0ad4ce229b33dc777/java/core/src/java/org/apache/orc/impl/writer/IntegerTreeWriter.java#L85-L94))
 Is this okay for Iceberg?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] lxynov commented on a change in pull request #199: ORC metrics

Reply via email to