[GitHub] [orc] omalley commented on a change in pull request #582: ORC-697: Improve scan tool to report the location of corruption.

GitBox Mon, 14 Dec 2020 16:34:42 -0800


omalley commented on a change in pull request #582:
URL: https://github.com/apache/orc/pull/582#discussion_r542954295




##########
File path: java/tools/src/java/org/apache/orc/tools/ScanData.java
##########
@@ -40,41 +40,168 @@
 
   static CommandLine parseCommandLine(String[] args) throws ParseException {
     Options options = new Options()
-        .addOption("help", "h", false, "Provide help");
-    return new GnuParser().parse(options, args);
+        .addOption("s", "schema", false, "Print schema")
+        .addOption("h", "help", false, "Provide help");
+    return new DefaultParser().parse(options, args);
   }
 
+  static int calculateBestVectorSize(int indexStride) {
+    if (indexStride == 0) {
+      return 1024;
+    }
+    // how many 1024 batches do we have in an index stride?
+    int batchCount = (indexStride + 1023) / 1024;
+    return indexStride / batchCount;
+  }
+
+  static class LocationInfo {
+    final long firstRow;
+    final long followingRow;
+    final int stripeId;
+    final long row;
+
+    LocationInfo(long firstRow, long followingRow, int stripeId,
+        long row) {
+      this.firstRow = firstRow;
+      this.followingRow = followingRow;
+      this.stripeId = stripeId;
+      this.row = row;
+    }
+
+    public String toString() {
+      return String.format("row %d in stripe %d (rows %d-%d)",
+          row, stripeId, firstRow, followingRow);
+    }
+  }
+
+  /**
+   * Given a row, find the stripe that contains that row.
+   * @param reader the file reader
+   * @param row the global row number in the file
+   * @return the information about that row in the file
+   */
+  static LocationInfo findStripeInfo(Reader reader, long row) {
+    long firstRow = 0;
+    int stripeId = 0;
+    for (StripeInformation stripe: reader.getStripes()) {
+      stripeId += 1;
+      long lastRow = firstRow + stripe.getNumberOfRows();
+      if (firstRow <= row && row < lastRow) {
+        return new LocationInfo(firstRow, lastRow, stripeId, row);
+      }
+      firstRow = lastRow;
+    }
+    return new LocationInfo(reader.getNumberOfRows(),
+        reader.getNumberOfRows(), stripeId, row);
+  }
 
-  static void main(Configuration conf, String[] args
-                   ) throws IOException, JSONException, ParseException {
+  /**
+   * Given a failure point, find the first place that the ORC reader can
+   * recover.
+   * @param reader the ORC reader
+   * @param current the position of the failure
+   * @param batchSize the size of the batch that we tried to read
+   * @return the location that we should recover to
+   */
+  static LocationInfo findRecoveryPoint(Reader reader, LocationInfo current,
+                                        int batchSize) {
+    int stride = reader.getRowIndexStride();
+    long result;
+    // In the worst case, just move to the next stripe
+    if (stride == 0 ||
+        current.row + batchSize >= current.followingRow) {
+      result = current.followingRow;
+    } else {
+      long rowInStripe = current.row + batchSize - current.firstRow;
+      result = current.firstRow + (rowInStripe + stride - 1) / stride * stride;
+    }
+    return findStripeInfo(reader, result);
+  }
+
+  static boolean findBadColumns(Reader reader, LocationInfo current, int 
batchSize,
+      TypeDescription column, boolean[] include) {
+    include[column.getId()] = true;
+    TypeDescription schema = reader.getSchema();
+    boolean result = false;
+    if (column.getChildren() == null) {
+      int row = 0;
+      try (RecordReader rows = reader.rows(reader.options().include(include))) 
{
+        rows.seekToRow(current.row);
+        VectorizedRowBatch batch = schema.createRowBatch(
+            TypeDescription.RowBatchVersion.USE_DECIMAL64, 1);
+        for(row=0; row < batchSize; ++row) {
+          rows.nextBatch(batch);
+        }
+      } catch (Throwable t) {
+        System.out.printf("Column %d failed at row %d%n", column.getId(),

Review comment:
       It looks like:
   
   ```
   Processing data file 
examples/corrupt/missing_length_stream_in_string_dict.orc [length: 1788]
   {"category": "struct", "id": 0, "max": 11, "fields": [
   {  "id": {"category": "int", "id": 1, "max": 1}},
   {  "bool_col": {"category": "boolean", "id": 2, "max": 2}},
   {  "tinyint_col": {"category": "tinyint", "id": 3, "max": 3}},
   {  "smallint_col": {"category": "smallint", "id": 4, "max": 4}},
   {  "int_col": {"category": "int", "id": 5, "max": 5}},
   {  "bigint_col": {"category": "bigint", "id": 6, "max": 6}},
   {  "float_col": {"category": "float", "id": 7, "max": 7}},
   {  "double_col": {"category": "double", "id": 8, "max": 8}},
   {  "date_string_col": {"category": "string", "id": 9, "max": 9}},
   {  "string_col": {"category": "string", "id": 10, "max": 10}},
   {  "timestamp_col": {"category": "timestamp", "id": 11, "max": 11}}]}
   Unable to read batch at row 0 in stripe 1 (rows 0-300), recovery at row 300 
in stripe 1 (rows 300-300)
   java.lang.NullPointerException
        at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryByteArray(TreeReaderFactory.java:2237)
        at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:2198)
        at 
org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1897)
        at 
org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:42)
        at 
org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:72)
        at 
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1236)
        at org.apache.orc.tools.ScanData.main(ScanData.java:175)
        at org.apache.orc.tools.Driver.main(Driver.java:126)
   Column 9 failed at row 0
   Column 10 failed at row 0
   Column 11 failed at row 0
   Unable to open file: 
examples/corrupt/missing_length_stream_in_string_dict.orc
   java.lang.IllegalArgumentException: Seek after the end of reader range
        at 
org.apache.orc.impl.RecordReaderImpl.findStripe(RecordReaderImpl.java:1310)
        at 
org.apache.orc.impl.RecordReaderImpl.seekToRow(RecordReaderImpl.java:1362)
        at org.apache.orc.tools.ScanData.main(ScanData.java:188)
        at org.apache.orc.tools.Driver.main(Driver.java:126)
   ```
   I'll update the patch to remove the final exception, which happens because 
it is trying to seek past the end of the file.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [orc] omalley commented on a change in pull request #582: ORC-697: Improve scan tool to report the location of corruption.

Reply via email to