[ 
https://issues.apache.org/jira/browse/PARQUET-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687450#comment-17687450
 ] 

ASF GitHub Bot commented on PARQUET-2241:
-----------------------------------------

wgtmac opened a new pull request, #1025:
URL: https://github.com/apache/parquet-mr/pull/1025

   ByteStreamSplitValuesReader depends on page.num_values which includes null 
values to compute the total stream length. Then it throws if it fails to read 
enough bytes from the page buffer. This certainly happens if the page contains 
null values.
   
   ### Jira
   
   [PARQUET-2241](https://issues.apache.org/jira/browse/PARQUET-2241)
   
   ### Tests
   
   Add test `org.apache.parquet.avro.TestByteStreamSplitE2E` to write and read 
floating values with BYTE_STREAM_SPLIT encoding.
   
   ### Commits
   
   `ByteStreamSplitValuesReader` strictly depends on remaining stream length to 
get actual number of encoded values before decoding.




> ByteStreamSplitDecoder broken in presence of nulls
> --------------------------------------------------
>
>                 Key: PARQUET-2241
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2241
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format, parquet-mr
>    Affects Versions: format-2.8.0
>            Reporter: Xuwei Fu
>            Assignee: Gang Wu
>            Priority: Major
>
>  
> This problem is shown in this issue: 
> [https://github.com/apache/arrow/issues/15173|https://github.com/apache/arrow/issues/15173Let]
> Let me talk about it briefly:
> * Encoder doesn't write "num_values" on Page payload for BYTE_STREAM_SPLIT, 
> but using "num_values" as stride in BYTE_STREAM_SPLIT
> * When decoding, for DATA_PAGE_V2, it can now the num_values and num_nulls in 
> the page, however, in DATA_PAGE_V1, without statistics, we should read 
> def-levels and rep-levels to get the real num-of-values. And without the 
> num-of-values, we aren't able to decode BYTE_STREAM_SPLIT correctly
>  
> The bug-reproducing code is in the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to