[ https://issues.apache.org/jira/browse/PARQUET-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17686938#comment-17686938 ]
Gang Wu commented on PARQUET-2241: ---------------------------------- It seems that the *ByteStreamSplitValuesReader* in the parquet-mr directly depends on *page.num_values* which includes null values to compute the total stream length. Then it throws if it fails to read enough bytes from the page buffer. It will throw for sure if any null exists. https://github.com/apache/parquet-mr/blob/5608695f5777de1eb0899d9075ec9411cfdf31d3/parquet-column/src/main/java/org/apache/parquet/column/values/bytestreamsplit/ByteStreamSplitValuesReader.java#L69 > ByteStreamSplitDecoder broken in presence of nulls > -------------------------------------------------- > > Key: PARQUET-2241 > URL: https://issues.apache.org/jira/browse/PARQUET-2241 > Project: Parquet > Issue Type: Bug > Components: parquet-format > Affects Versions: format-2.8.0 > Reporter: Xuwei Fu > Priority: Major > Fix For: format-2.10.0 > > > > This problem is shown in this issue: > [https://github.com/apache/arrow/issues/15173|https://github.com/apache/arrow/issues/15173Let] > Let me talk about it briefly: > * Encoder doesn't write "num_values" on Page payload for BYTE_STREAM_SPLIT, > but using "num_values" as stride in BYTE_STREAM_SPLIT > * When decoding, for DATA_PAGE_V2, it can now the num_values and num_nulls in > the page, however, in DATA_PAGE_V1, without statistics, we should read > def-levels and rep-levels to get the real num-of-values. And without the > num-of-values, we aren't able to decode BYTE_STREAM_SPLIT correctly > > The bug-reproducing code is in the issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)