Hi all, I used `gzfile` and `gzcon` to read a compressed file but I found that `gzcon` gave me a different result than `gzfile`. It seems like the `gzcon` does not handle the data correctly. I have posted an example below. In the example, a portion of a compressed file is downloaded from Google Cloud as a raw vector, and the data is saved into a temp file. If I use ` gzfile` to read the file, it can show the first 1000 lines successfully. However, if I wrap the raw vector as a connection, and use `gzcon` to read from that connection, it shows the first 884 lines along with a warning(see the output).
code: > # installed.packages("BiocManager") > # BiocManager::install("GCSConnection", version = "devel") > library(GCSConnection) > ## Download data from cloud > uri <- > "gs://gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.chr1.vcf.bgz" > con <- gcs_connection(uri) > data <- readBin(con, raw(), 4*1024*1024) > close(con) > ## write data to a file > file_path <- tempfile() > writeBin(data, file_path) > ## Read the data using `gzfile` > con1 <- gzfile(file_path) > str(readLines(con1, 1000)) > ## Read the data using `gzcon` > ## We create a raw connection from the raw vector > con2 <- gzcon(rawConnection(data)) > str(readLines(con2, 1000)) output: > > str(readLines(con1, 1000)) > chr [1:1000] "##fileformat=VCFv4.2" "##hailversion=0.2.24-9cd88d97bedd" > ... > > str(readLines(con2, 1000)) > chr [1:884] "##fileformat=VCFv4.2" "##hailversion=0.2.24-9cd88d97bedd" ... > Warning message: > In readLines(con2, 1000) : incomplete final line found on 'gzcon(data)' I am not sure if this is caused by a bug in `gzcon` or the misuse of the function. The same result can be observed at R4.0 and R4.1 devel on Win. Here is my session info, I hope it can be helpful. Any suggestions and help would be appreciated. R Under development (unstable) (2020-06-27 r78747) > Platform: x86_64-w64-mingw32/x64 (64-bit) > Running under: Windows 10 x64 (build 18363) > Matrix products: default > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C > > [5] LC_TIME=English_United States.1252 > system code page: 65001 Best, Jiefei [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel