On 05/03/2018 05:48 AM, Joris Meys wrote:
Dear all,

I've been diving a bit deeper into this per request of Tomas Kalibra, and
found the following :

- the lock on the file is only after trying to read it using oligo, so
that's not a R problem in itself. The problem is independent of extrenal
packages.

- using Windows' fc utility and cygwin's cmp utility I found out that every
so often the download.file() function inserts an extra byte. There's no
real obvious pattern in how these bytes are added, but the file downloaded
using download.file() is actually larger (in this case by about 8 kb). The
file xxx_inR.CEL.gz is read in using:

I believe the difference in mode = "w" vs "wb", and the reason this is restricted to Windows downloads, is due to the difference in text file line endings, where with mode="w", download.file (and many other utilities outside R) recognize the "foo\n" as "foo\r\n". Obviously this messes up binary files.

I guess in the CEL.gz file there are about 8k "\n" characters.

Henrik's suggestion (default = "wb") would introduce the complementary problem -- text files would have incorrect line endings.

Martin




setwd("E:/Temp/genexpr/Compare")
id <- "GSM907854"
flink <- paste0("
https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM907854&format=file&file=GSM907854%2ECEL%2Egz
")
fname <- paste0(id,"_inR.CEL.gz")
download.file(flink,
               destfile = fname)

The file xxx_direct.CEL.gz is downloaded from
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM907854 (download link
at the bottom of the page).

Output of dir in CMD:

05/03/2018  11:02 AM         4,529,547 GSM907854_direct.CEL.gz
05/03/2018  11:17 AM         4,537,668 GSM907854_inR.CEL.gz

or from R :

diff(file.size(dir())) # contains both CEL files.
[1] 8121

Strangely enough I get the following message from download.file() :

Content type 'application/octet-stream' length 4529547 bytes (4.3 MB)
downloaded 4.3 MB

So the reported length is exactly the same as if I would download the file
directly, but the file on disk itself is larger. So it seems
download.file() is adding bytes when saving the data on disk.  This
behaviour is independent of antivirus and/or firewalls turned on or off.

Also keep in mind that these are NOT standard gzipped files. These files
are a specific format for Affymetrix Human Gene 1.0 ST Arrays.

If I need to run other tests, please let me know.
Kind regards

Joris

On Wed, May 2, 2018 at 9:21 PM, Joris Meys <jorism...@gmail.com> wrote:

Dear all,

I've noticed by trying to download gz files from here :
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM907811

At the bottom one can download GSM907811.CEL.gz . If I download this
manually and try

oligo::read.celfiles("GSM907811.CEL.gz")

everything works fine. (oligo is a bioConductor package)

However, if I download using

download.file("https://www.ncbi.nlm.nih.gov/geo/download/
?acc=GSM907811&format=file&file=GSM907811%2ECEL%2Egz",
               destfile = "GSM907811.CEL.gz")

The file is downloaded, but oligo::read.celfiles() returns the following
error:

Error in checkChipTypes(filenames, verbose, "affymetrix", TRUE) :
   End of gz file reached unexpectedly. Perhaps this file is truncated.

Moreover, if I try to delete it after using download.file(), I get a
warning that permission is denied. I can only remove it using Windows file
explorer after I closed the R session, indicating that the connection is
still open. Yet, showConnections() doesn't show any open connections either.

Session info below. Note that I started from a completely fresh R session.
oligo is needed due to the specific file format of these gz files. They're
not standard tarred files.

Cheers
Joris

Session Info
------------------------------------------------------------
-------------------------

R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United
Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C

[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
methods
[9] base

other attached packages:
  [1] pd.hugene.1.0.st.v1_3.14.1 DBI_0.8
oligo_1.44.0
  [4] Biobase_2.39.2             oligoClasses_1.42.0
RSQLite_2.1.0
  [7] Biostrings_2.48.0          XVector_0.19.9
IRanges_2.13.28
[10] S4Vectors_0.17.42          BiocGenerics_0.25.3

loaded via a namespace (and not attached):
  [1] Rcpp_0.12.16                compiler_3.5.0
  [3] BiocInstaller_1.30.0        GenomeInfoDb_1.15.5
  [5] bitops_1.0-6                iterators_1.0.9
  [7] tools_3.5.0                 zlibbioc_1.25.0
  [9] digest_0.6.15               bit_1.1-12
[11] memoise_1.1.0               preprocessCore_1.41.0
[13] lattice_0.20-35             ff_2.2-13
[15] pkgconfig_2.0.1             Matrix_1.2-14
[17] foreach_1.4.4               DelayedArray_0.5.31
[19] yaml_2.1.18                 GenomeInfoDbData_1.1.0
[21] affxparser_1.52.0           bit64_0.9-7
[23] grid_3.5.0                  BiocParallel_1.13.3
[25] blob_1.1.1                  codetools_0.2-15
[27] matrixStats_0.53.1          GenomicRanges_1.31.23
[29] splines_3.5.0               SummarizedExperiment_1.9.17
[31] RCurl_1.95-4.10             affyio_1.49.2


--
Joris Meys
Statistical consultant

Department of Data Analysis and Mathematical Modelling
Ghent University
Coupure Links 653, B-9000 Gent (Belgium)

<https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>

-----------
Biowiskundedagen 2017-2018
http://www.biowiskundedagen.ugent.be/

-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php






This email message may contain legally privileged and/or...{{dropped:2}}

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to