[
https://issues.apache.org/jira/browse/PARQUET-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577431#comment-17577431
]
Joy Bestourous commented on PARQUET-2168:
-----------------------------------------
hello, is there any update on this ticket? thanks!
> Potential bug in ParquetWriteProtocol
> -------------------------------------
>
> Key: PARQUET-2168
> URL: https://issues.apache.org/jira/browse/PARQUET-2168
> Project: Parquet
> Issue Type: Bug
> Reporter: Joy Bestourous
> Priority: Minor
>
> We found what we think is a bug in ParquetWriteProtocol, in which
> ParquetWriteProtocol will fail on instantiation of StructWriteProtocol if the
> StructType contains an empty child struct.
> Specifically, for the ParquetWriteProtocol, if the thriftStruct contains an
> empty struct, logic in ThriftSchemaConvertVisitor drops the element, yielding
> a MessageType that has 1 fewer fields than the original schema. Subsequent
> logic in ParquetWriteProtocol.StructWriteProtocol tries to populate a
> `children` element by iterating through the thrift struct children and trying
> to get the element from the ColumnIO object
> {code:java}
> Given: ThriftStruct with 20 fields
> MessageType schema =
> ThriftSchemaConverter.convertWithoutProjection(thriftStruct)
> -> ThriftSchemaConvertVisotor.convert(StructType struct...)
> -> -> Visitor = new ThriftSchemaConvertVisitor(filter, true,
> keepOneOfEachUnion), state)
> -> -> ConvertedField = struct.accept(visitor)
> -> -> -> ThriftSchemaConvertVisotor.visit(struct, state)
> -> -> -> -> ConvertedField converted = child.getType().accept(this,
> childState)
> -> -> -> -> ThriftSchemaConvertVisotor.visit(struct, state) //here we’re at
> the child struct{code}
> In here, we have both hasSentinelUnionColumns and hasNonSentinelUnionColumns
> defaulted as false and we look for any child elements, in which case, one of
> these is updated to true. Thus, when we come to this step, we fall into the
> Drop() case.
>
> {code:java}
> if (hasNonSentinelUnionColumns) {
> // user requested some of the fields of this struct, so we keep the
> struct
> return new Keep(state.path, new GroupType(state.repetition, state.name,
> convertedChildren));
> } else {
> // user requested none of the fields of this struct, so we drop it
> return new Drop(state.path);
> }{code}
>
> Because this field is Dropped, our MessageType.fieldsList is 19
>
> {code:java}
> ColumnIO = new ColumnIOFactory().getColumnIO(MessageType) // again yields a
> ColumnIO with only 19 fields
> TProtocol = new ParquetWriteProtocol(recordConsumer, columnIo, thriftStruct)
> -> MessageWriteProtocol = new MessageWriteProtocol(ColumnIO schema,
> StructType thriftType)
> -> -> new StructWriteProtocol(ColumnIO schema, StructType thriftType...)
> for (i = 0 to thriftStruct.children.size) // which is 20
> schema.getChild(i) // Out of bounds error on index 19{code}
> We currently have a workaround for this but would like to get a fix if
> possible.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)