[ 
https://issues.apache.org/jira/browse/PARQUET-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577431#comment-17577431
 ] 

Joy Bestourous commented on PARQUET-2168:
-----------------------------------------

hello, is there any update on this ticket? thanks!

> Potential bug in ParquetWriteProtocol
> -------------------------------------
>
>                 Key: PARQUET-2168
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2168
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Joy Bestourous
>            Priority: Minor
>
> We found what we think is a bug in ParquetWriteProtocol, in which 
> ParquetWriteProtocol will fail on instantiation of StructWriteProtocol if the 
> StructType contains an empty child struct.
> Specifically, for the ParquetWriteProtocol, if the thriftStruct contains an 
> empty struct, logic in ThriftSchemaConvertVisitor drops the element, yielding 
> a MessageType that has 1 fewer fields than the original schema. Subsequent 
> logic in ParquetWriteProtocol.StructWriteProtocol tries to populate a 
> `children` element by iterating through the thrift struct children and trying 
> to get the element from the ColumnIO object
> {code:java}
> Given: ThriftStruct with 20 fields
> MessageType schema = 
> ThriftSchemaConverter.convertWithoutProjection(thriftStruct)
> -> ThriftSchemaConvertVisotor.convert(StructType struct...)
> -> -> Visitor = new ThriftSchemaConvertVisitor(filter, true, 
> keepOneOfEachUnion), state)
> -> -> ConvertedField = struct.accept(visitor)
> -> -> -> ThriftSchemaConvertVisotor.visit(struct, state)
> -> -> -> -> ConvertedField converted = child.getType().accept(this, 
> childState)
> -> -> -> -> ThriftSchemaConvertVisotor.visit(struct, state) //here we’re at 
> the child struct{code}
>  In here, we have both hasSentinelUnionColumns and hasNonSentinelUnionColumns 
> defaulted as false and we look for any child elements, in which case, one of 
> these is updated to true.  Thus, when we come to this step, we fall into the 
> Drop() case.    
>  
> {code:java}
>   if (hasNonSentinelUnionColumns) {
>       // user requested some of the fields of this struct, so we keep the 
> struct
>       return new Keep(state.path, new GroupType(state.repetition, state.name, 
> convertedChildren));
>     } else {
>       // user requested none of the fields of this struct, so we drop it
>       return new Drop(state.path);
>     }{code}
>  
> Because this field is Dropped, our MessageType.fieldsList is 19
>  
> {code:java}
> ColumnIO = new ColumnIOFactory().getColumnIO(MessageType) // again yields a 
> ColumnIO with only 19 fields
> TProtocol = new ParquetWriteProtocol(recordConsumer, columnIo, thriftStruct)
> -> MessageWriteProtocol = new MessageWriteProtocol(ColumnIO schema, 
> StructType thriftType)
> -> -> new StructWriteProtocol(ColumnIO schema, StructType thriftType...)
> for (i = 0 to thriftStruct.children.size) // which is 20
>  schema.getChild(i) // Out of bounds error on index 19{code}
> We currently have a workaround for this but would like to get a fix if 
> possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to