0x26res opened a new issue, #3112:
URL: https://github.com/apache/parquet-java/issues/3112

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   TLDR: the parquet protobuf reader doesn't work for UInt32Value
   
   I have protobuf using wrapped unsigned and signed integer:
   
   ```
   syntax = "proto3";
   
   package org.apache.parquet.test;
   
   import "google/protobuf/wrappers.proto";
   
   message MyTestMessage {
       google.protobuf.UInt32Value uint32_field = 11;
       google.protobuf.Int32Value int32_field = 12;
   }
   ```
   
   I then generate a parquet file for that data:
   
   ```
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   pq.write_table(
       pa.table(
           {
               "uint32_field": pa.array([None, None, 28], pa.uint32()),
               "int32_field": pa.array([None, 28, 28], pa.int32()),
           }
       ),
       "/tmp/my_test_messages.parquet",
   )
   ```
   
   And I try to read it using parquet-java (in kotlin, but it doesn't matter):
   
   ```kotlin
   package org.apache.parquet.test
   
   import org.apache.parquet.test.MyTestMessage
   import com.google.protobuf.Int32Value
   import io.kotest.matchers.shouldBe
   import org.apache.hadoop.fs.Path
   import org.apache.parquet.proto.ProtoConstants
   import org.apache.parquet.proto.ProtoParquetReader
   import org.apache.parquet.proto.ProtoReadSupport
   import org.junit.jupiter.api.Test
   
   class TestUInt32Value {
     @Test
     fun `test can load bad not nested plain`() {
       val reader =
         ProtoParquetReader.builder<MyTestMessage.Builder>(
             Path("file:///tmp/my_test_messages.parquet")
           )
           .set(ProtoReadSupport.PB_CLASS, 
MyTestMessage::class.java.canonicalName)
           .set(ProtoConstants.CONFIG_IGNORE_UNKNOWN_FIELDS, "true")
           .build()
       val firstMessage = reader.read().build()
       firstMessage shouldBe MyTestMessage.getDefaultInstance()
   
       val secondMessage = reader.read().build()
       secondMessage shouldBe
         MyTestMessage.newBuilder().setInt32Field(Int32Value.of(28)).build()
   
   
       val thirdMessage = reader.read()
     }
   }
   ```
   
   I get this error when reading the third message:
   
   ```
       org.apache.parquet.io.ParquetDecodingException: Can not read value at 3 
in block 0 in file file:/tmp/my_test_messages.parquet
           at 
app//org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:280)
           at 
app//org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
           at app//org.apache.parquet.test.TestUInt32Value.test can load bad 
not nested plain(TestUInt32Value.kt:29)
           Caused by:
           java.lang.UnsupportedOperationException: 
org.apache.parquet.proto.ProtoMessageConverter$ProtoUInt32ValueConverter
               at 
org.apache.parquet.io.api.PrimitiveConverter.addInt(PrimitiveConverter.java:101)
               at 
org.apache.parquet.column.impl.ColumnReaderBase$2$3.writeValue(ColumnReaderBase.java:321)
               at 
org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:486)
               at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
               at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:425)
               at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:249)
               ... 2 more
   ```
   
   A few thing to note:
   - this works for the second message, which means it is implemented correctly 
for (signed) Int32Value
   - It works if you generate the data using the JVM.
      But this is because when you do so the parquet table has got a different 
structure (each message is a nested struct `{"value": 28}` 
   ```kotlin
     @Test
     fun `test jvm round trip`() {
   
       val path = Path("file:///tmp/my_test_messages_jvm.parquet")
   
       ProtoParquetWriter.builder<MyTestMessage>(path)
         .withMessage(MyTestMessage::class.java)
         .build()
         .use {
           it.write(MyTestMessage.getDefaultInstance())
           
it.write(MyTestMessage.newBuilder().setInt32Field(Int32Value.of(28)).build())
           it.write(
             MyTestMessage.newBuilder()
               .setInt32Field(Int32Value.of(1))
               .setUint32Field(UInt32Value.of(32))
               .build()
           )
           it.close()
         }
   
       val reader =
         ProtoParquetReader.builder<MyTestMessage.Builder>(path)
           .set(ProtoReadSupport.PB_CLASS, 
MyTestMessage::class.java.canonicalName)
           .set(ProtoConstants.CONFIG_IGNORE_UNKNOWN_FIELDS, "true")
           .build()
       val firstMessage = reader.read().build()
       firstMessage shouldBe MyTestMessage.getDefaultInstance()
   
       val secondMessage = reader.read().build()
       secondMessage shouldBe 
MyTestMessage.newBuilder().setInt32Field(Int32Value.of(28)).build()
   
       val thirdMessage = reader.read().build()
       thirdMessage shouldBe MyTestMessage.newBuilder()
         .setInt32Field(Int32Value.of(1))
         .setUint32Field(UInt32Value.of(32))
         .build()
     }
   ```
   
   This is basically generating a table that looks like this:
   
   ```python
   import pyarrow as pa
   
   pa.table(
       {
           "uint32_field": pa.array(
               [None, None, {"value": 28}], pa.struct([("value", pa.uint32())])
           ),
           "int32_field": pa.array(
               [None, {"value": 28}, {"value": 28}], pa.struct([("value", 
pa.int32())])
           ),
       }
   )
   
   ```
   
   
   ### Component(s)
   
   Protobuf


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to