Joy Bestourous created PARQUET-2168:
---------------------------------------

             Summary: Potential bug in ParquetWriteProtocol
                 Key: PARQUET-2168
                 URL: https://issues.apache.org/jira/browse/PARQUET-2168
             Project: Parquet
          Issue Type: Bug
            Reporter: Joy Bestourous


We found what we think is a bug in ParquetWriteProtocol, in which 
ParquetWriteProtocol will fail on instantiation of StructWriteProtocol if the 
StructType contains an empty child struct.

Specifically, for the ParquetWriteProtocol, if the thriftStruct contains an 
empty struct, logic in ThriftSchemaConvertVisitor drops the element, yielding a 
MessageType that has 1 fewer fields than the original schema. Subsequent logic 
in ParquetWriteProtocol.StructWriteProtocol tries to populate a `children` 
element by iterating through the thrift struct children and trying to get the 
element from the ColumnIO object
{code:java}
Given: ThriftStruct with 20 fields
MessageType schema = 
ThriftSchemaConverter.convertWithoutProjection(thriftStruct)
-> ThriftSchemaConvertVisotor.convert(StructType struct...)
-> -> Visitor = new ThriftSchemaConvertVisitor(filter, true, 
keepOneOfEachUnion), state)
-> -> ConvertedField = struct.accept(visitor)
-> -> -> ThriftSchemaConvertVisotor.visit(struct, state)
-> -> -> -> ConvertedField converted = child.getType().accept(this, childState)
-> -> -> -> ThriftSchemaConvertVisotor.visit(struct, state) //here we’re at the 
child struct{code}
 In here, we have both hasSentinelUnionColumns and hasNonSentinelUnionColumns 
defaulted as false and we look for any child elements, in which case, one of 
these is updated to true.  Thus, when we come to this step, we fall into the 
Drop() case.    

 
{code:java}
  if (hasNonSentinelUnionColumns) {
      // user requested some of the fields of this struct, so we keep the struct
      return new Keep(state.path, new GroupType(state.repetition, state.name, 
convertedChildren));
    } else {
      // user requested none of the fields of this struct, so we drop it
      return new Drop(state.path);
    }{code}
 

Because this field is Dropped, our MessageType.fieldsList is 19

 
{code:java}
ColumnIO = new ColumnIOFactory().getColumnIO(MessageType) // again yields a 
ColumnIO with only 19 fields
TProtocol = new ParquetWriteProtocol(recordConsumer, columnIo, thriftStruct)
-> MessageWriteProtocol = new MessageWriteProtocol(ColumnIO schema, StructType 
thriftType)
-> -> new StructWriteProtocol(ColumnIO schema, StructType thriftType...)

for (i = 0 to thriftStruct.children.size) // which is 20
 schema.getChild(i) // Out of bounds error on index 19{code}

We currently have a workaround for this but would like to get a fix if possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to