jbewing commented on code in PR #14853:
URL: https://github.com/apache/iceberg/pull/14853#discussion_r2700589034
##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedPageIterator.java:
##########
@@ -100,6 +101,14 @@ protected void initDataReader(Encoding dataEncoding,
ByteBufferInputStream in, i
case DELTA_BINARY_PACKED:
valuesReader = new VectorizedDeltaEncodedValuesReader();
break;
+ case RLE:
+ if (desc.getPrimitiveType().getPrimitiveTypeName()
+ == PrimitiveType.PrimitiveTypeName.BOOLEAN) {
+ valuesReader =
+ new
VectorizedRunLengthEncodedParquetValuesReader(setArrowValidityVector);
+ break;
+ }
+ // fall through
Review Comment:
> given the parquet spec limits what RLEs can be used for to bools,
Repetition and definition levels & Dictionary indices. Is it likely to occur in
the wild?
Yeah theoretically for a malformed parquet writer this could occur in the
wild. That being said it wouldn't be to spec given that bool is the only data
page that can be RLE encoded and we handle the dictionary RLE up in
`VectorizedDictionaryEncodedParquetValuesReader` (directly above here) and the
repetition levels are handled via `VectorizedParquetDefinitionLevelReader`.
All to say, I think this is impossible. If a malformed writer does in fact
write a file with a non-bool data page, it wouldn't be to spec so we'd be
correctly throwing here. I can add a negative test case for this, although I'd
have to make a corrupt parquet writer implementation to do so. Happy to do if
you think it adds value.
Also FWIW, the [full parquet v2 vectorized impl
PR](https://github.com/apache/iceberg/pull/14800) (that this PR was split out
from has quite a few production PBs read under its belt at this point and
hasn't hit anything like this in the wild.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]