[
https://issues.apache.org/jira/browse/ARROW-17008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563989#comment-17563989
]
Antoine Pitrou commented on ARROW-17008:
----------------------------------------
I don't think that's the explanation, no. Again, the explanation is simply that
Snappy is bad at compressing this particular data.
Also note the integer-vs-double compression is a bit biased, because you are
comparing int32 against float64. If you were comparing int64 against float64
instead, you would seel that int64 compresses as well as float64.
(you get int32 in R because R doesn't have any native 64-bit integers, AFAIU)
> [R] Parquet Snappy Compression Fails for Integer Type Data
> ----------------------------------------------------------
>
> Key: ARROW-17008
> URL: https://issues.apache.org/jira/browse/ARROW-17008
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 8.0.0
> Environment: R4.2.1 Ubuntu 22.04 x86_64
> R4.1.2 Ubuntu 22.04 Aarch64
> Reporter: Charlie Gao
> Priority: Major
>
> Snappy compression is not working when writing to parquet for integer type
> data.
> E.g. compare file sizes for:
> {code:r}
> write_parquet(data.frame(x = 1:1e6), "snappy.parquet", compression = "snappy")
> write_parquet(data.frame(x = 1:1e6), "uncomp.parquet", compression =
> "uncompressed")
> {code}
> whereas for double:
> {code:r}
> write_parquet(data.frame(x = as.double(1:1e6)), "snappyd.parquet",
> compression = "snappy")
> write_parquet(data.frame(x = as.double(1:1e6)), "uncompd.parquet",
> compression = "uncompressed")
> {code}
> I have inspected the integer files using parquet-tools and compression level
> shows as 0%. Needless to say, I can achieve compression using Spark
> (sparklyr) etc.
> Thanks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)