0x26res opened a new issue, #3112:
URL: https://github.com/apache/parquet-java/issues/3112
### Describe the bug, including details regarding any error messages,
version, and platform.
TLDR: the parquet protobuf reader doesn't work for UInt32Value
I have protobuf using wrapped unsigned and signed integer:
```
syntax = "proto3";
package org.apache.parquet.test;
import "google/protobuf/wrappers.proto";
message MyTestMessage {
google.protobuf.UInt32Value uint32_field = 11;
google.protobuf.Int32Value int32_field = 12;
}
```
I then generate a parquet file for that data:
```
import pyarrow as pa
import pyarrow.parquet as pq
pq.write_table(
pa.table(
{
"uint32_field": pa.array([None, None, 28], pa.uint32()),
"int32_field": pa.array([None, 28, 28], pa.int32()),
}
),
"/tmp/my_test_messages.parquet",
)
```
And I try to read it using parquet-java (in kotlin, but it doesn't matter):
```kotlin
package org.apache.parquet.test
import org.apache.parquet.test.MyTestMessage
import com.google.protobuf.Int32Value
import io.kotest.matchers.shouldBe
import org.apache.hadoop.fs.Path
import org.apache.parquet.proto.ProtoConstants
import org.apache.parquet.proto.ProtoParquetReader
import org.apache.parquet.proto.ProtoReadSupport
import org.junit.jupiter.api.Test
class TestUInt32Value {
@Test
fun `test can load bad not nested plain`() {
val reader =
ProtoParquetReader.builder<MyTestMessage.Builder>(
Path("file:///tmp/my_test_messages.parquet")
)
.set(ProtoReadSupport.PB_CLASS,
MyTestMessage::class.java.canonicalName)
.set(ProtoConstants.CONFIG_IGNORE_UNKNOWN_FIELDS, "true")
.build()
val firstMessage = reader.read().build()
firstMessage shouldBe MyTestMessage.getDefaultInstance()
val secondMessage = reader.read().build()
secondMessage shouldBe
MyTestMessage.newBuilder().setInt32Field(Int32Value.of(28)).build()
val thirdMessage = reader.read()
}
}
```
I get this error when reading the third message:
```
org.apache.parquet.io.ParquetDecodingException: Can not read value at 3
in block 0 in file file:/tmp/my_test_messages.parquet
at
app//org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:280)
at
app//org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
at app//org.apache.parquet.test.TestUInt32Value.test can load bad
not nested plain(TestUInt32Value.kt:29)
Caused by:
java.lang.UnsupportedOperationException:
org.apache.parquet.proto.ProtoMessageConverter$ProtoUInt32ValueConverter
at
org.apache.parquet.io.api.PrimitiveConverter.addInt(PrimitiveConverter.java:101)
at
org.apache.parquet.column.impl.ColumnReaderBase$2$3.writeValue(ColumnReaderBase.java:321)
at
org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:486)
at
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
at
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:425)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:249)
... 2 more
```
A few thing to note:
- this works for the second message, which means it is implemented correctly
for (signed) Int32Value
- It works if you generate the data using the JVM.
But this is because when you do so the parquet table has got a different
structure (each message is a nested struct `{"value": 28}`
```kotlin
@Test
fun `test jvm round trip`() {
val path = Path("file:///tmp/my_test_messages_jvm.parquet")
ProtoParquetWriter.builder<MyTestMessage>(path)
.withMessage(MyTestMessage::class.java)
.build()
.use {
it.write(MyTestMessage.getDefaultInstance())
it.write(MyTestMessage.newBuilder().setInt32Field(Int32Value.of(28)).build())
it.write(
MyTestMessage.newBuilder()
.setInt32Field(Int32Value.of(1))
.setUint32Field(UInt32Value.of(32))
.build()
)
it.close()
}
val reader =
ProtoParquetReader.builder<MyTestMessage.Builder>(path)
.set(ProtoReadSupport.PB_CLASS,
MyTestMessage::class.java.canonicalName)
.set(ProtoConstants.CONFIG_IGNORE_UNKNOWN_FIELDS, "true")
.build()
val firstMessage = reader.read().build()
firstMessage shouldBe MyTestMessage.getDefaultInstance()
val secondMessage = reader.read().build()
secondMessage shouldBe
MyTestMessage.newBuilder().setInt32Field(Int32Value.of(28)).build()
val thirdMessage = reader.read().build()
thirdMessage shouldBe MyTestMessage.newBuilder()
.setInt32Field(Int32Value.of(1))
.setUint32Field(UInt32Value.of(32))
.build()
}
```
This is basically generating a table that looks like this:
```python
import pyarrow as pa
pa.table(
{
"uint32_field": pa.array(
[None, None, {"value": 28}], pa.struct([("value", pa.uint32())])
),
"int32_field": pa.array(
[None, {"value": 28}, {"value": 28}], pa.struct([("value",
pa.int32())])
),
}
)
```
### Component(s)
Protobuf
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]