Re: [PR] DRILL-8507, DRILL-8508 Better handling of partially missing parquet columns (drill)

via GitHub Thu, 29 Aug 2024 14:02:11 -0700


paul-rogers commented on code in PR #2937:
URL: https://github.com/apache/drill/pull/2937#discussion_r1737180959



##########
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetSchema.java:
##########
@@ -227,12 +251,14 @@ public void createNonExistentColumns(OutputMutator 
output, List<NullableIntVecto
    * @throws SchemaChangeException should not occur
    */
 
-  private NullableIntVector createMissingColumn(SchemaPath col, OutputMutator 
output) throws SchemaChangeException {
-    // col.toExpr() is used here as field name since we don't want to see 
these fields in the existing maps
-    MaterializedField field = MaterializedField.create(col.toExpr(),
-                                                    
Types.optional(TypeProtos.MinorType.INT));
-    return (NullableIntVector) output.addField(field,
-              TypeHelper.getValueVectorClass(TypeProtos.MinorType.INT, 
DataMode.OPTIONAL));
+  private ValueVector createMissingColumn(SchemaPath col, OutputMutator 
output) throws SchemaChangeException {
+    String colName = col.getAsUnescapedPath();
+    MaterializedField tableField = tableSchema.column(colName);
+    TypeProtos.MinorType type = tableField == null ? TypeProtos.MinorType.INT 
: tableField.getType().getMinorType();
+    MaterializedField field = MaterializedField.create(colName,

Review Comment:
   This is a good change: we are propagating the old column type. This is 
consistent with EVF.
   
   However, this stuff is horribly complex. If we reuse the column, we **must** 
reuse the actual value vector. Otherwise, you'll get crashes in the downstream 
operators that are bound to that vector. The binding is redone only on a schema 
change. But, your fix avoids the schema change, and hence prevents the 
rebinding.
   
   Also, note that this fix works ONLY in one direction (column appears, then 
disappears), and ONLY within a single thread: it can't solve the same problem 
if the two files are read in different threads and sent to the SORT to 
reconcile.
   
   Further, we are changing the mode to `OPTIONAL` as required so we can fill 
the vector with `NULL` values. However, change of mode (i.e. nullability) 
**is** a schema change and will cause the SORT to fail. We have to have known, 
on the previous file, that the column will be missing in this file, so that we 
can create the original column as `OPTIONAL`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] DRILL-8507, DRILL-8508 Better handling of partially missing parquet columns (drill)

Reply via email to