[GitHub] [arrow] nealrichardson commented on pull request #13625: ARROW-16612: [R] Support inferring compression from filename for all readers/writers

GitBox Fri, 22 Jul 2022 05:56:06 -0700


nealrichardson commented on PR #13625:
URL: https://github.com/apache/arrow/pull/13625#issuecomment-1192545063


   > after this PR we get a file with a .gz extension that is not gzipped
   
   The file isn't gzipped but gzip compression is used internally in 
compressing the Parquet file contents. I agree that that is odd, but it is 
consistent with my understanding of how compression filename extensions are 
used with Parquet customarily.
   
   Weirdness aside, the bigger issue IMO is that this PR fixes at least 3 bugs 
where the current code fails. On master, `write_parquet("XXXX.gz")` writes a 
Parquet file and then compresses it with gzip around it, but then 
`read_parquet("XXXX.gz")` can't read it. Moreover, `write_parquet("XXXX.zst")` 
would also write a gzipped file, and `write_parquet("XXXX.snappy")` wouldn't 
compress at all. If we think that write_parquet shouldn't infer compression 
from the filename at all, that's fine, we can make that change on top of this 
PR, but we should move forward with the rest of the changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nealrichardson commented on pull request #13625: ARROW-16612: [R] Support inferring compression from filename for all readers/writers

Reply via email to