[jira] [Commented] (PARQUET-1622) Add BYTE_STREAM_SPLIT encoding

2023-01-17 Thread Gang Wu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678056#comment-17678056
 ] 

Gang Wu commented on PARQUET-1622:
--

The issue raised by [~mwish] above may also exist in the parquet-mr.

cc [~xinlishang]] [~gershinsky]

> Add BYTE_STREAM_SPLIT encoding
> --
>
> Key: PARQUET-1622
> URL: https://issues.apache.org/jira/browse/PARQUET-1622
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp, parquet-format, parquet-mr, parquet-thrift
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: features, pull-request-available
> Fix For: 1.12.0, format-2.8.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1622) Add BYTE_STREAM_SPLIT encoding

2023-01-17 Thread Xuwei Fu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678055#comment-17678055
 ] 

Xuwei Fu commented on PARQUET-1622:
---

[~gszadovszky] [~martinradev] 

Hi all, I meet a problem here: [https://github.com/apache/arrow/issues/15173]

Would you mind take a look? Seems we don't have "non-null value count" here.

> Add BYTE_STREAM_SPLIT encoding
> --
>
> Key: PARQUET-1622
> URL: https://issues.apache.org/jira/browse/PARQUET-1622
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp, parquet-format, parquet-mr, parquet-thrift
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: features, pull-request-available
> Fix For: 1.12.0, format-2.8.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1622) Add BYTE_STREAM_SPLIT encoding

2020-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035310#comment-17035310
 ] 

ASF GitHub Bot commented on PARQUET-1622:
-

gszadovszky commented on pull request #705: PARQUET-1622: Add implementation 
for BYTE_STREAM_SPLIT
URL: https://github.com/apache/parquet-mr/pull/705
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add BYTE_STREAM_SPLIT encoding
> --
>
> Key: PARQUET-1622
> URL: https://issues.apache.org/jira/browse/PARQUET-1622
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp, parquet-format, parquet-mr, parquet-thrift
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: features, pull-request-available
> Fix For: format-2.8.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view



--
This message was sent by Atlassian Jira
(v8.3.4#803005)