On 05/03/2018 05:48 AM, Joris Meys wrote:
Dear all,
I've been diving a bit deeper into this per request of Tomas Kalibra, and
found the following :
- the lock on the file is only after trying to read it using oligo, so
that's not a R problem in itself. The problem is independent of extrenal
packages.
- using Windows' fc utility and cygwin's cmp utility I found out that every
so often the download.file() function inserts an extra byte. There's no
real obvious pattern in how these bytes are added, but the file downloaded
using download.file() is actually larger (in this case by about 8 kb). The
file xxx_inR.CEL.gz is read in using:
I believe the difference in mode = "w" vs "wb", and the reason this is
restricted to Windows downloads, is due to the difference in text file
line endings, where with mode="w", download.file (and many other
utilities outside R) recognize the "foo\n" as "foo\r\n". Obviously this
messes up binary files.
I guess in the CEL.gz file there are about 8k "\n" characters.
Henrik's suggestion (default = "wb") would introduce the complementary
problem -- text files would have incorrect line endings.
Martin
setwd("E:/Temp/genexpr/Compare")
id <- "GSM907854"
flink <- paste0("
https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM907854&format=file&file=GSM907854%2ECEL%2Egz
")
fname <- paste0(id,"_inR.CEL.gz")
download.file(flink,
destfile = fname)
The file xxx_direct.CEL.gz is downloaded from
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM907854 (download link
at the bottom of the page).
Output of dir in CMD:
05/03/2018 11:02 AM 4,529,547 GSM907854_direct.CEL.gz
05/03/2018 11:17 AM 4,537,668 GSM907854_inR.CEL.gz
or from R :
diff(file.size(dir())) # contains both CEL files.
[1] 8121
Strangely enough I get the following message from download.file() :
Content type 'application/octet-stream' length 4529547 bytes (4.3 MB)
downloaded 4.3 MB
So the reported length is exactly the same as if I would download the file
directly, but the file on disk itself is larger. So it seems
download.file() is adding bytes when saving the data on disk. This
behaviour is independent of antivirus and/or firewalls turned on or off.
Also keep in mind that these are NOT standard gzipped files. These files
are a specific format for Affymetrix Human Gene 1.0 ST Arrays.
If I need to run other tests, please let me know.
Kind regards
Joris
On Wed, May 2, 2018 at 9:21 PM, Joris Meys <jorism...@gmail.com> wrote:
Dear all,
I've noticed by trying to download gz files from here :
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM907811
At the bottom one can download GSM907811.CEL.gz . If I download this
manually and try
oligo::read.celfiles("GSM907811.CEL.gz")
everything works fine. (oligo is a bioConductor package)
However, if I download using
download.file("https://www.ncbi.nlm.nih.gov/geo/download/
?acc=GSM907811&format=file&file=GSM907811%2ECEL%2Egz",
destfile = "GSM907811.CEL.gz")
The file is downloaded, but oligo::read.celfiles() returns the following
error:
Error in checkChipTypes(filenames, verbose, "affymetrix", TRUE) :
End of gz file reached unexpectedly. Perhaps this file is truncated.
Moreover, if I try to delete it after using download.file(), I get a
warning that permission is denied. I can only remove it using Windows file
explorer after I closed the R session, indicating that the connection is
still open. Yet, showConnections() doesn't show any open connections either.
Session info below. Note that I started from a completely fresh R session.
oligo is needed due to the specific file format of these gz files. They're
not standard tarred files.
Cheers
Joris
Session Info
------------------------------------------------------------
-------------------------
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United
Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
methods
[9] base
other attached packages:
[1] pd.hugene.1.0.st.v1_3.14.1 DBI_0.8
oligo_1.44.0
[4] Biobase_2.39.2 oligoClasses_1.42.0
RSQLite_2.1.0
[7] Biostrings_2.48.0 XVector_0.19.9
IRanges_2.13.28
[10] S4Vectors_0.17.42 BiocGenerics_0.25.3
loaded via a namespace (and not attached):
[1] Rcpp_0.12.16 compiler_3.5.0
[3] BiocInstaller_1.30.0 GenomeInfoDb_1.15.5
[5] bitops_1.0-6 iterators_1.0.9
[7] tools_3.5.0 zlibbioc_1.25.0
[9] digest_0.6.15 bit_1.1-12
[11] memoise_1.1.0 preprocessCore_1.41.0
[13] lattice_0.20-35 ff_2.2-13
[15] pkgconfig_2.0.1 Matrix_1.2-14
[17] foreach_1.4.4 DelayedArray_0.5.31
[19] yaml_2.1.18 GenomeInfoDbData_1.1.0
[21] affxparser_1.52.0 bit64_0.9-7
[23] grid_3.5.0 BiocParallel_1.13.3
[25] blob_1.1.1 codetools_0.2-15
[27] matrixStats_0.53.1 GenomicRanges_1.31.23
[29] splines_3.5.0 SummarizedExperiment_1.9.17
[31] RCurl_1.95-4.10 affyio_1.49.2
--
Joris Meys
Statistical consultant
Department of Data Analysis and Mathematical Modelling
Ghent University
Coupure Links 653, B-9000 Gent (Belgium)
<https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>
-----------
Biowiskundedagen 2017-2018
http://www.biowiskundedagen.ugent.be/
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
This email message may contain legally privileged and/or...{{dropped:2}}
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel