[GitHub] [drill] paul-rogers commented on a change in pull request #2026: DRILL-7330: Implement metadata usage for all format plugins

GitBox Sat, 14 Mar 2020 12:38:00 -0700

paul-rogers commented on a change in pull request #2026: DRILL-7330: Implement 
metadata usage for all format plugins
URL: https://github.com/apache/drill/pull/2026#discussion_r392610254


 ##########
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/ColumnExplorer.java
 ##########
 @@ -292,39 +289,86 @@ public static int getPartitionDepth(FileSelection 
selection) {
    * @param includeFileImplicitColumns if file implicit columns should be 
included into the result
    * @param fs                         file system
    * @param index                      index of row group to populate
+   * @param start                      start of row group to populate
+   * @param length                     length of row group to populate
    * @return implicit columns map
    */
   public Map<String, String> populateImplicitAndInternalColumns(Path filePath,
       List<String> partitionValues, boolean includeFileImplicitColumns, 
FileSystem fs, int index, long start, long length) {
 
     Map<String, String> implicitValues =
-        new LinkedHashMap<>(populateImplicitColumns(filePath, partitionValues, 
includeFileImplicitColumns));
+        new LinkedHashMap<>(populateImplicitAndInternalColumns(filePath, 
partitionValues, includeFileImplicitColumns, fs));
 
-    selectedInternalColumns.forEach((key, value) -> {
-      switch (value) {
+    selectedInternalColumns.forEach(
+        (key, value) -> implicitValues.put(key, getImplicitColumnValue(value, 
filePath, fs, index, start, length)));
+
+    return implicitValues;
+  }
+
+  /**
+   * Returns implicit column value for specified implicit file column.
+   *
+   * @param column   implicit file column
+   * @param filePath file path, used to populate file implicit columns
+   * @param fs       file system
+   * @param index    row group index
+   * @param start    row group start
+   * @param length   row group length
+   * @return implicit column value for specified implicit file column
+   */
+  private static String getImplicitColumnValue(ImplicitFileColumn column, Path 
filePath,
+      FileSystem fs, Integer index, Long start, Long length) {
+    if (column instanceof ImplicitFileColumns) {
+      ImplicitFileColumns fileColumn = (ImplicitFileColumns) column;
+      return fileColumn.getValue(filePath);
+    } else if (column instanceof ImplicitInternalFileColumns) {
+      ImplicitInternalFileColumns fileColumn = (ImplicitInternalFileColumns) 
column;
+      switch (fileColumn) {
         case ROW_GROUP_INDEX:
-          implicitValues.put(key, String.valueOf(index));
-          break;
+          return index != null ? String.valueOf(index) : null;
         case ROW_GROUP_START:
-          implicitValues.put(key, String.valueOf(start));
-          break;
+          return start != null ? String.valueOf(start) : null;
         case ROW_GROUP_LENGTH:
-          implicitValues.put(key, String.valueOf(length));
-          break;
+          return length != null ? String.valueOf(length) : null;
         case PROJECT_METADATA:
-          implicitValues.put(key, Boolean.TRUE.toString());
-          break;
+          return Boolean.TRUE.toString();
         case LAST_MODIFIED_TIME:
           try {
-            implicitValues.put(key, 
String.valueOf(fs.getFileStatus(filePath).getModificationTime()));
+            return fs != null ? 
String.valueOf(fs.getFileStatus(filePath).getModificationTime()) : null;
 
 Review comment:
   Here is exactly the issue that concerned me above. If the file disappeared, 
it will now be this code that detects it. Notice the exception is very generic: 
it will provide the user no information about the problem.
   
   The `ScanOperatorExec` is designed to handle this case: if a file does not 
exist, the batch reader's `open()` should return `false`. The 
`ScanOperatorExec` then skips over that file and moves to the next. There is 
some tricky code to deal with schemas in that case.
   
   Perhaps a better solution here is to return 0 if the file is not found, 
realizing that something else will deal with the missing file before any harm 
can be done with the false modification time.
   
   Why do we handle missing files this way? On the desktop, a missing file 
might want to fail the query. (I had only one file and it was delete. That is 
bad.) But, on a system such as that described by @dobesv, data will arrive and 
expire all the time. Failing an hour-long query because one file of millions is 
missing might fall more on the "bug" than "feature" side of the spectrum. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #2026: DRILL-7330: Implement metadata usage for all format plugins

Reply via email to