Joy Bestourous created PARQUET-2168:
---------------------------------------
Summary: Potential bug in ParquetWriteProtocol
Key: PARQUET-2168
URL: https://issues.apache.org/jira/browse/PARQUET-2168
Project: Parquet
Issue Type: Bug
Reporter: Joy Bestourous
We found what we think is a bug in ParquetWriteProtocol, in which
ParquetWriteProtocol will fail on instantiation of StructWriteProtocol if the
StructType contains an empty child struct.
Specifically, for the ParquetWriteProtocol, if the thriftStruct contains an
empty struct, logic in ThriftSchemaConvertVisitor drops the element, yielding a
MessageType that has 1 fewer fields than the original schema. Subsequent logic
in ParquetWriteProtocol.StructWriteProtocol tries to populate a `children`
element by iterating through the thrift struct children and trying to get the
element from the ColumnIO object
{code:java}
Given: ThriftStruct with 20 fields
MessageType schema =
ThriftSchemaConverter.convertWithoutProjection(thriftStruct)
-> ThriftSchemaConvertVisotor.convert(StructType struct...)
-> -> Visitor = new ThriftSchemaConvertVisitor(filter, true,
keepOneOfEachUnion), state)
-> -> ConvertedField = struct.accept(visitor)
-> -> -> ThriftSchemaConvertVisotor.visit(struct, state)
-> -> -> -> ConvertedField converted = child.getType().accept(this, childState)
-> -> -> -> ThriftSchemaConvertVisotor.visit(struct, state) //here we’re at the
child struct{code}
In here, we have both hasSentinelUnionColumns and hasNonSentinelUnionColumns
defaulted as false and we look for any child elements, in which case, one of
these is updated to true. Thus, when we come to this step, we fall into the
Drop() case.
{code:java}
if (hasNonSentinelUnionColumns) {
// user requested some of the fields of this struct, so we keep the struct
return new Keep(state.path, new GroupType(state.repetition, state.name,
convertedChildren));
} else {
// user requested none of the fields of this struct, so we drop it
return new Drop(state.path);
}{code}
Because this field is Dropped, our MessageType.fieldsList is 19
{code:java}
ColumnIO = new ColumnIOFactory().getColumnIO(MessageType) // again yields a
ColumnIO with only 19 fields
TProtocol = new ParquetWriteProtocol(recordConsumer, columnIo, thriftStruct)
-> MessageWriteProtocol = new MessageWriteProtocol(ColumnIO schema, StructType
thriftType)
-> -> new StructWriteProtocol(ColumnIO schema, StructType thriftType...)
for (i = 0 to thriftStruct.children.size) // which is 20
schema.getChild(i) // Out of bounds error on index 19{code}
We currently have a workaround for this but would like to get a fix if possible.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)