Re: [R] Memory limit for Windows 64bit build of R
Alan, More RAM will definitely help. But if you have an object needing more than 2^31-1 ~ 2 billion elements, you'll hit a wall regardless. This could be particularly limiting for matrices. It is less limiting for data.frame objects (where each column could be 2 billion elements). But many R analytics under the hood use matrices, so you may not know up front where you could hit a limit. Jay Original message I have a Windows Server 2008 R2 Enterprise machine, with 64bit R installed running on 2 x Quad-core Intel Xeon 5500 processor with 24GB DDR3 1066 Mhz RAM. I am seeking to analyse very large data sets (perhaps as much as 10GB), without the addtional coding overhead of a package such as bigmemory(). My question is this - if we were to increase the RAM on the machine to (say) 128GB, would this become a possibility? I have read the documentation on memory limits and it seems so, but would like some additional confirmation before investing in any extra RAM. - -- John W. Emerson (Jay) Associate Professor of Statistics, Adjunct, and Acting Director of Graduate Studies Department of Statistics Yale University http://www.stat.yale.edu/~jay [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] bigmemory
To answer your first question about read.big.matrix(), we don't know what your acc3.dat file is, but it doesn't appear to have been detected as a standard file (like a CSV file) or -- perhaps -- doesn't even exist (or doesn't exist in your current directory)? Next: In addition, I am planning to do a multiple imputation with MICE package using the data read by bigmemory package. So usually, the multiple imputation code is like this: imp=mice(data.frame,m=50,seed=1234,print=F) the data.frame is required. How can I change the big.matrix class generated by bigmemory package to a data.frame? Please read the help files for bigmemory -- only matrix-like objects are supported. However, the more serious problem is that you can't expect to run just any R function on a big.matrix (or on an ff object, if you check out ff for some nice features). In particular, for large data sets you would likely use up all of RAM (other reasons are more subtle and important, but out of place in this reply). Jay -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] bigmemory
R internally uses 32-bit integers for indexing (though this may change). For this and other reasons these external objects with specialized purposes (larger-than-RAM, shared memory) simply can't behave exactly as R objects. Best case, some R functions will work. Others would simply break. Others would perhaps work if the problem is small enough, but would choke in the creation of temporary objects in memory. I understand your sentiment, but it isn't that easy. If you are interested, however, we do provide examples of authoring functions in C++ which can work interchangeably on both matrix and big.matrix objects. Jay Hi Jay, I have a question about your reply. You mentioned that the more serious problem is that you can't expect to run just any R function on a big.matrix (or on an ff object, if you check out ff for some nice features). I am confused why the packages could not communicate with each other. I understand that maybe for some programming or statistical reasons, one package need its own class so that specific algorithm can be implemented. However, R as a statistical programming environment, one of its advantages is the abundance of the packages under R structure. If different packages generate different kinds of object and can not be recognized and used for further analysis by other packages, then each package would appears to be similar with the normal independent software, e.g., SAS, MATLAB... then this could reduce the whole R ability for handling complicated analysis situation. This is just a general thought. Thank you very much. -- ya -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] bigmemory on Solaris
At one point we might have gotten something working (older version?) on Solaris x86, but were never successful on Solaris sparc that I remember -- it isn't a platform we can test and support. We believe there are problems with BOOST library compatibilities. We'll try (again) to clear up the other warnings in the logs, though. !-) We should also revisit the possibility of a CRAN BOOST library for use by a small group of packages (like bigmemory) which might make patches to BOOST easier to track and maintain. This might improve things in the long run. Jay -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Foreach (doMC)
Jannis, I'm not complete sure I understand your first point, but maybe someone from REvolution will weigh in. Nobody is forcing anyone to purchase any products, and there are attractive alternatives such as the CRAN R and R Studio (to name two). This issue has arisen many times of the various lists and you are welcome to search the archives and read many very intelligent, thoughtful opinions. As for foreach, etc... if you have fairly focused questions (preferably with a reproducible example if there is a problem) and if you have done reading on examples available on using it, then you might try joining the r-sig-...@r-project.org group. Clearly there are far more users of core R and hence mainstream questions on r-help are likely to be answered more quickly (on average) than specialized questions. Regards, Jay On Thu, Oct 20, 2011 at 4:27 PM, Jannis bt_jan...@yahoo.de wrote: Dear list members, dear Jay, Well, I personally do not care about Revolutions Analytics selling their products as this is also included into the idea of many open source licences. Especially as Revolutions provide their packages to the community and its is everybodies personal choice to buy their special R version. I was just wondering about this issue as usually most questions on r-help are answered pretty soon and by many different people and I had the impression that this is not the case for posts regarding the foreach/doMC/doSMP etc packages. This may, however, be also due to the probably limited use of these packages for most users who do not need these high performance computing things. Or it was just my personal perception or pure chance. Thanks however, to the authors of such packages! They were of great help to me on several ocasions and I have deep respect for everybody devoting his time to open source software! Jannis On 10/19/2011 01:26 PM, Jay Emerson wrote: P.S. Is there any particular reason why there are so seldom answers to posts regarding foreach and all these doMC/doSMP packages ? Do so few people use these packages or does this have anything to do with the commercial origin of these packages? Jannis, An interesting question. I'm a huge fan of foreach and the parallel backends, and have used foreach in some of my packages. It leaves the choice of backend to the user, rather than forcing some environment. If you like multicore, great -- the package doesn't care. Someone else may use doSNOW. No problem. To answer your question, foreach was originally written by (primarily, at least) Steve Weston, previously of REvolution Computing. It, along with some of the parallel backends (perhaps all at this point, I'm out of touch) are available open-source. Hence, I'd argue that the commercial origin is a moot point -- it doesn't matter, it will always be available, and it's really useful. Steve is no longer with REvolution, however, and I can't speak for the responsiveness/interest of current REvolution folks on this point. Scanning R-help daily for things relating to my own packages is something I try to do, but it doesn't always happen. I would like to think foreach is widely used -- it does have a growing list of reverse depends/suggests. And was updated as recently as last May, I just noticed. http://cran.r-project.org/web/packages/foreach/index.html Jay -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Foreach (doMC)
P.S. Is there any particular reason why there are so seldom answers to posts regarding foreach and all these doMC/doSMP packages ? Do so few people use these packages or does this have anything to do with the commercial origin of these packages? Jannis, An interesting question. I'm a huge fan of foreach and the parallel backends, and have used foreach in some of my packages. It leaves the choice of backend to the user, rather than forcing some environment. If you like multicore, great -- the package doesn't care. Someone else may use doSNOW. No problem. To answer your question, foreach was originally written by (primarily, at least) Steve Weston, previously of REvolution Computing. It, along with some of the parallel backends (perhaps all at this point, I'm out of touch) are available open-source. Hence, I'd argue that the commercial origin is a moot point -- it doesn't matter, it will always be available, and it's really useful. Steve is no longer with REvolution, however, and I can't speak for the responsiveness/interest of current REvolution folks on this point. Scanning R-help daily for things relating to my own packages is something I try to do, but it doesn't always happen. I would like to think foreach is widely used -- it does have a growing list of reverse depends/suggests. And was updated as recently as last May, I just noticed. http://cran.r-project.org/web/packages/foreach/index.html Jay -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] efficient coding with foreach and bigmemory
First, we strongly recommend 64-bit R. Otherwise, you may not be able to scale up as far as you would like. Second, as I think you realize, with big objects you may have to do things in chunks. I generally recommend working a column at a time rather than in blocks of rows if possible (better performance, particularly if the filebacking is used because of matrices exceeding RAM), and you may find that alternative data organization can really pay off. Keep an open mind. Third, you really need to avoid this runif(1,...) usage. It can't possibly be efficient. If a single call to runif() doesn't work, break it into chunks, certainly, but going down to chunks of size 1 just can't make any sense. Fourth, although you aren't there yet, once you get to the point you are trying to do things in parallel with foreach and bigmemory, you *may* need to place the following inside your foreach loop to make use of the shared memory properly: mdesc - describe(m) foreach(...) %dopar% { require(bigmemory) m - attach.big.matrix(mdesc) now operate on m } I say *may* because the backend doMC (not available in Windows) does not require this, but the other backends do; otherwise, the workers will not be able to properly address the shared-memory or filebacked big.matrix. Some documentation on bigmemory.org may help, and feel free to email us directly. Jay -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Exception while using NeweyWest function with doMC
Simon, Though we're please to see another use of bigmemory, it really isn't clear that it is gaining you anything in your example; anything like as.big.matrix(matrix(...)) still consumes full RAM for both the inner matrix() and the new big.matrix -- is the filebacking really necessary. It also doesn't appear that you are making use of shared memory, so I'm unsure what the gains are. However, I don't have any particular insight as to the subsequent problem with NeweyWest (which doesn't seem to be using the big.matrix objects). Jay -- Message: 32 Date: Sat, 27 Aug 2011 21:37:55 +0200 From: Simon Zehnder simon.zehn...@googlemail.com To: r-help@r-project.org Subject: [R] Exception while using NeweyWest function with doMC Message-ID: cagqvrp_gk+t0owbv1ste-y0zafmi9s_zwqrxyxugsui18ms...@mail.gmail.com Content-Type: text/plain Dear R users, I am using R right now for a simulation of a model that needs a lot of memory. Therefore I use the *bigmemory* package and - to make it faster - the *doMC* package. See my code posted on http://pastebin.com/dFRGdNrG snip - -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Installation of bigmemory fails
Premal, Package authors generally welcome direct emails. We've been away from this project since the release of 2.13.0 and I only just noticed the build errors. These generally occur because of some (usually small and solvable) problem with compilers and the BOOST libraries. We'll look at it and see what we can do. Please email us if you don't hear back in the next week or so. Thanks, Jay --- Hello All, I tried to intall the bigmemory package from a CRAN mirror site and received the following output while installing. Any idea what's going on and how to fix it? The system details are provided below. - begin error messages --- * installing *source* package 'bigmemory' ... checking for Sun Studio compiler...no checking for Darwin...yes ** libs g++45 -I/usr/local/lib/R/include -I../inst/include -fpic -O2 -fno-strict-aliasing -pipe -Wl,-rpath=/usr/local/lib/gcc45 -c B\ igMatrix.cpp -o BigMatrix.o g++45 -I/usr/local/lib/R/include -I../inst/include -fpic -O2 -fno-strict-aliasing -pipe -Wl,-rpath=/usr/local/lib/gcc45 -c S\ haredCounter.cpp -o SharedCounter.o g++45 -I/usr/local/lib/R/include -I../inst/include -fpic -O2 -fno-strict-aliasing -pipe -Wl,-rpath=/usr/local/lib/gcc45 -c b\ igmemory.cpp -o bigmemory.o bigmemory.cpp: In function 'bool TooManyRIndices(index_type)': bigmemory.cpp:40:27: error: 'powl' was not declared in this scope *** Error code 1 Stop in /tmp/Rtmpxwe3p4/R.INSTALL4f539336/bigmemory/src. ERROR: compilation failed for package 'bigmemory' * removing '/usr/local/lib/R/library/bigmemory' The downloaded packages are in '/tmp/RtmpMZCOVp/downloaded_packages' Updating HTML index of packages in '.Library' Making packages.html ... done Warning message: In install.packages(bigmemory) : installation of package 'bigmemory' had non-zero exit status - end error messages - It's a 64-bit FreeBSD 7.2 system running R version 2-13.0. Thanks, Premal -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Kolmogorov-smirnov test
Taylor Arnold and I have developed a package ks.test (available on R-Forge in beta version) that modifies stats::ks.test to handle discrete null distributions for one-sample tests. We also have a draft of a paper we could provide (email us). The package uses methodology of Conover (1972) and Gleser (1985) to provide exact p-values. It also corrects an algorithmic problem with stats::ks.test in the calculation of the test statistic. This is not a bug, per se, because it was never intended to be used this way. We will submit this new function for inclusion in package stats once we're done testing. So, for example: # With the default ks.test (ouch): stats::ks.test(c(0,1), ecdf(c(0,1))) One-sample Kolmogorov-Smirnov test data: c(0, 1) D = 0.5, p-value = 0.5 alternative hypothesis: two-sided # With our new function (what you would want in this toy example): ks.test::ks.test(c(0,1), ecdf(c(0,1))) One-sample Kolmogorov-Smirnov test data: c(0, 1) D = 0, p-value = 1 alternative hypothesis: two-sided Original Message: Date: Mon, 28 Feb 2011 21:31:26 +1100 From: Glen Barnett glnbr...@gmail.com To: tsippel tsip...@gmail.com Cc: r-help@r-project.org Subject: Re: [R] Kolmogorov-smirnov test Message-ID: aanlktikcjigrgjuotkozqfxfqatin6arzjvt_appi...@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1 It's designed for continuous distributions. See the first sentence here: http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test K-S is conservative on discrete distributions On Sat, Feb 19, 2011 at 1:52 PM, tsippel tsip...@gmail.com wrote: Is the kolmogorov-smirnov test valid on both continuous and discrete data? ?I don't think so, and the example below helped me understand why. A suggestion on testing the discrete data would be appreciated. Thanks, -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] lm without intercept
No, this is a cute problem, though: the definition of R^2 changes without the intercept, because the empty model used for calculating the total sums of squares is always predicting 0 (so the total sums of squares are sums of squares of the observations themselves, without centering around the sample mean). Your interpretation of the p-value for the intercept in the first model is also backwards: 0.9535 is extremely weak evidence against the hypothesis that the intercept is 0. That is, the intercept might be near zero, but could also be something veru different. With a standard error of 229, your 95% confidence interval for the intercept (if you trusted it based on other things) would have a margin of error of well over 400. If you told me that an intercept of, say 350 or 400 were consistent with your knowledge of the problem, I wouldn't blink. This is a very small data set: if you sent an R command such as: x - c(x1, x2, ..., xn) y - c(y1, y2, ..., yn) you might even get some more interesting feedback. One of the many good intro stats textbooks might also be helpful as you get up to speed. Jay - Original post: Message: 135 Date: Fri, 18 Feb 2011 11:49:41 +0100 From: Jan jrheinlaen...@gmx.de To: R-help@r-project.org list r-help@r-project.org Subject: [R] lm without intercept Message-ID: 1298026181.2847.19.camel@jan-laptop Content-Type: text/plain; charset=UTF-8 Hi, I am not a statistics expert, so I have this question. A linear model gives me the following summary: Call: lm(formula = N ~ N_alt) Residuals: Min 1Q Median 3Q Max -110.30 -35.80 -22.77 38.07 122.76 Coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 13.5177 229.0764 0.059 0.9535 N_alt 0.2832 0.1501 1.886 0.0739 . --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 56.77 on 20 degrees of freedom (16 observations deleted due to missingness) Multiple R-squared: 0.151, Adjusted R-squared: 0.1086 F-statistic: 3.558 on 1 and 20 DF, p-value: 0.07386 The regression is not very good (high p-value, low R-squared). The Pr value for the intercept seems to indicate that it is zero with a very high probability (95.35%). So I repeat the regression forcing the intercept to zero: Call: lm(formula = N ~ N_alt - 1) Residuals: Min 1Q Median 3Q Max -110.11 -36.35 -22.13 38.59 123.23 Coefficients: Estimate Std. Error t value Pr(|t|) N_alt 0.292046 0.007742 37.72 2e-16 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 55.41 on 21 degrees of freedom (16 observations deleted due to missingness) Multiple R-squared: 0.9855, Adjusted R-squared: 0.9848 F-statistic: 1423 on 1 and 21 DF, p-value: 2.2e-16 1. Is my interpretation correct? 2. Is it possible that just by forcing the intercept to become zero, a bad regression becomes an extremely good one? 3. Why doesn't lm suggest a value of zero (or near zero) by itself if the regression is so much better with it? Please excuse my ignorance. Jan Rheinl?nder -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [Fwd: adding more columns in big.matrix object of bigmemory package]
For good reasons (having to do with avoiding copies of massive things) we leave such merging to the user: create a new filebacking of the proper size, and fill it (likely a column at a time, assuming you have enough RAM to support that). Jay On Fri, Dec 17, 2010 at 2:16 AM, utkarshsinghal utkarsh.sing...@global-analytics.com wrote: Hi, With reference to the mail below, I have large datasets, coming from various different sources, which I can read into filebacked big.matrix using library bigmemory. I want to merge them all into one 'big.matrix' object. (Later, I want to run regression using library 'biglm'). I am unsuccessfully trying to do this from quite some time now. Can you please suggest some way? Am I missing some already available function? Even a functionality of the following will work for me: Just appending more columns in an existing big.matrix object (not merging). If the individual datasets are small enough to be read in usual R, just the combined dataset is huge. Any thoughts are welcome. Thanks, Utkarsh Original Message Subject: adding more columns in big.matrix object of bigmemory package Date: Thu, 16 Dec 2010 18:29:38 +0530 From: utkarshsinghal utkarsh.sing...@global-analytics.com To: r help r-h...@stat.math.ethz.ch Hi all, Is there any way I can add more columns to an existing filebacked big.matrix object. In general, I want a way to modify an existing big.matrix object, i.e., add rows/columns, rename colnames, etc. I tried the following: library(bigmemory) x = read.big.matrix(test.csv,header=T,type=double,shared=T,backingfile=test.backup,descriptorfile=test.desc) x[,v4] = new Error in mmap(j, colnames(x)) : Couldn't find a match to one of the arguments. (The above functionality is presently there in usual data.frames in R.) Thanks in advance, Utkarsh -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] big data and lmer
Though bigmemory, ff, and other big data solutions (databases, etc...) can help easily manage massive data, their data objects are not natively compatible with all the advanced functionality of R. Exceptions include lm and glm (both ff and bigmemory support his via Lumley's biglm package), kmeans, and perhaps a few other things. In many cases, it's just a matter of someone deciding to port a tool/analysis of interest to one of these different object types -- we welcome collaborators and would be happy to offer advice if you want to adapt something for bigmemory structures! Jay -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging and working with big data sets
I can't speak for ff and filehash, but bigmemory's data structure doesn't allow clever merges (for actually good reasons). However, it is still probably less painful (and faster) than other options, though we don't implement it: we leave it to the user because details may vary depending on the example and the code is trivial. - Allocate an empty new filebacked big.matrix of the proper size. - Fill it in chunks (typically a column at a time if you can afford the RAM overhead, or a portion of a column at a time). Column operations are more efficient than row operations (again, because of the internals of the data structure). - Because you'll be using filebackings, RAM limitations won't matter other than the overhead of copying each chunk. I should note: if you used separated=TRUE, each column would have a separate binary file, and a smart cbind() would be possible simply by manipulating the descriptor file. Again, not something we advise or formally provide, but it wouldn't be hard. Jay -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] bigmemory doubt
By far the easiest way to achieve this would be to use the bigmemory C++ structures in your program itself. However, if you do something on your own (but fundamentally have a column-major matrix in shared memory), it should be possible to play around with the pointer with R/bigmemory to accomplish this, yes. Feel free to email us directly for advice. Jay Message: 153 Date: Wed, 8 Sep 2010 10:52:19 +0530 (IST) From: raje...@cse.iitm.ac.in raje...@cse.iitm.ac.in To: r-help r-help@r-project.org Subject: [R] bigmemory doubt Message-ID: 1204692515.13855.1283923339865.javamail.r...@mail.cse.iitm.ac.in Content-Type: text/plain Hi, Is it possible for me to read data from shared memory created by a vc++ program into R using bigmemory? -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Bigmemory: Error Running Example
It seems very likely you are working on a 32-bit version of R, but it's a little surprising still that you would have a problem with any single year. Please tell us the operating system and version of R. Did you preprocess the airline CSV file using the utilities provided on bigmemory.org? If you don't, then anything character will be converted to NA. Is your R environment empty, or did you have other objects in memory? It might help to just do some tests yourself: x - big.matrix(nrow=100, ncol=10, ... other options .) Make sure it works, then increase the size until you get a failure. This sort of exercise is extremely helpful in situations like this. Jay Subject: [R] Bigmemory: Error Running Example Message-ID: aanlktint+xsxiuyainbcstmbdkedtawb--wfccgnr...@mail.gmail.comaanlktint%2bxsxiuyainbcstmbdkedtawb--wfccgnr...@mail.gmail.com Content-Type: text/plain Hi, I am trying to run the bigmemory example provided on the http://www.bigmemory.org/ The example runs on the airline data and generates summary of the csv files:- library(bigmemory) library(biganalytics) x - read.big.matrix(2005.csv, type=integer, header=TRUE, backingfile=airline.bin, descriptorfile=airline.desc, extraCols=Age) summary(x) This runs fine for the provided csv for year 1987 (size=121MB). However, for big files like for year 2005 (size=639MB), it gives following errors:- Error in filebacked.big.matrix(nrow = nrow, ncol = ncol, type = type, : Problem creating filebacked matrix. Error: object 'x' not found Error in summary(x) : error in evaluating the argument 'object' in selecting a method for function 'summary' Here is the output from running the memory.limit() :- [1] 2047 Here is the output from running the memory.profile() :- NULL symbolpairlist closure environment promise 19381 3255706477 7443710 language special builtinchar logical integer 121940 1781600 1506895188981 double complex character ... anylist 7983 17 47593 0 04073 expressionbytecode externalptr weakref raw S4 2 0 618 117 1191838 Anyone who has previously worked with bigmemory before could throw some light on it. Were you able to run the examples successfully? Thanks in advance. Harsh Yadav -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay http://www.stat.yale.edu/%7Ejay [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] (help) This is an R workspace memory processing question
You should look at packages like ff, bigmemory, RMySQL, and so on. However, you should really consider moving to a different platform for large-data work (Linux, Mac, or Windows 7 64-bit). Jay - This is an R workspace memory processing question. There is a method which from the R will control 10GB data at 500MB units? my work environment : R version : 2.11.1 OS : WinXP Pro sp3 -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Parallel computing on Windows (foreach) (Sergey Goriatchev)
foreach (or virtually anything you might use for concurrent programming) only really makes sense if the work the clients are doing is substantial enough to overwhelm the communication overhead. And there are many ways to accomplish the same task more or less efficiently (for example, doing blocks of tasks in chunks rather than passing each one as an individual job). But more to the point, doSNOW works just fine on an SMP, no problem, it doesn't require a cluster. Jay example code omitted Not only is the sequential foreach much slower than the simple for-loop (as least in this particular instance), but I am not quite sure how to make foreach run parallel. Where would I get this parallel backend? I looked at doMC and doRedis, but these do not run on Windows, as far as I understand. And doSNOW is something to use when you have a cluster, while I have a simple dual-core PC. It is not really clear for how to make parallel computing work. Please, help. Regards, Sergey -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay http://www.stat.yale.edu/%7Ejay [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] [R-pkgs] Bayesian change point package bcp 2.2.0 available
Version 2.2.0 of package bcp is now available. It replaces the suggests of NetWorkSpaces (previously used for optional parallel MCMC) with the dependency on package foreach, giving greater flexibility and supporting a wider range of parallel backends (see doSNOW, doMC, etc...). For those unfamiliar with foreach (thanks to Steve Weston for this contribution), it's a beautiful and highly portable looping construct which can run sequentially or in parallel based on the user's actions (rather than the programmer's choices). We think other package authors might want to consider taking advantage of it for tasks that might be computationally intensive and could be easily done in parallel. Some vignettes are available at http://cran.r-project.org/web/packages/foreach/index.html. Jay Emerson Chandra Erdman (Apologies, the first version of this announcement was not plain-text.) -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay ___ R-packages mailing list r-packa...@r-project.org https://stat.ethz.ch/mailman/listinfo/r-packages __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] [R-pkgs] bigmemory 4.2.3
The long-promised revision to bigmemory has arrived, with package 4.2.3 now on CRAN. The mutexes (locks) have been extracted and will be available through package synchronicity (on R-Forge, soon to appear on CRAN). Initial versions of packages biganalytics and bigtabulate are on CRAN, and new versions which resolve the warnings and have streamlined CRAN-friendly configurations will appear shortly. Package bigalgebra will remain on R-Forge for the time being as the user-interface is developed and the configuration possibilities expand. For more information, please feel free to email us or visit http://www.bigmemory.org/. Jay Emerson Mike Kane -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay ___ R-packages mailing list r-packa...@r-project.org https://stat.ethz.ch/mailman/listinfo/r-packages __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] [R-pkgs] Bayesian change point package bcp 2.2.0 available
Version 2.2.0 of package bcp is now available. It replaces the suggests of NetWorkSpaces (previously used for optional parallel MCMC) with the dependency on package foreach, giving greater flexibility and supporting a wider range of parallel backends (see doSNOW, doMC, etc...). For those unfamiliar with foreach (thanks to Steve Weston for this contribution), it's a beautiful and highly portable looping construct which can run sequentially or in parallel based on the user's actions (rather than the programmer's choices). We think other package authors might want to consider taking advantage of it for tasks that might be computationally intensive and could be easily done in parallel. Some vignettes are available at http://cran.r-project.org/web/packages/foreach/index.html. Jay Emerson Chandra Erdman -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay [[alternative HTML version deleted]] ___ R-packages mailing list r-packa...@r-project.org https://stat.ethz.ch/mailman/listinfo/r-packages __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] bigmemory package woes
Zerdna, Please note that the CRAN version 3.12 is about to be replaced by a new cluster of packages now on R-Forge; we consider the new bigmemory = 4.0 to be stable and recommend you start using it immediately. Please see http://www.bigmemory.org. In your case, two comments: (1) Your for() loop will generate three identical copies of filebackings on disk, yes. Note that when the loop exists, the R object xx will reference only the 3rd of these, so xx[1,1] - 1 will modify only the third filebacking, not the first two. You'll need to use the separate descriptor files (probably created automatically for you, but we recommend naming them specifically using descriptorfile=) to attach.big.matrix() whatever of these you really want to be using. (2) In the problem with hanging I believe you have exhausted the shared resources on your system. This problem will no longer arise in the = 4.0 problems, as we're handling mutexes separately rather than automatically. These shared resource limits are mysterious, depending on the OS as well as the hardware and other jobs or tasks in existence at any given point in time. But again, it shouldn't be a problem with the new version. The CRAN update should take place early next week, along with some revised documentation. Regards, Jay --- Message: 125 Date: Fri, 23 Apr 2010 13:51:32 -0800 (PST) From: zerdna az...@yahoo.com To: r-help@r-project.org Subject: [R] bigmemory package woes Message-ID: 1272059492009-2062996.p...@n4.nabble.com Content-Type: text/plain; charset=us-ascii I have pretty big data sizes, like matrices of .5 to 1.5GB so once i need to juggle several of them i am in need of disk cache. I am trying to use bigmemory package but getting problems that are hard to understand. I am getting seg faults and machine just hanging. I work by the way on Red Hat Linux, 64 bit R version 10. Simplest problem is just saving matrices. When i do something like r-matrix(rnorm(100), nr=10); librarybigmemory) for(i in 1:3) xx-as.big.matrix(r, backingfile=paste(r,i, sep=, collapse=), backingpath=MyDirName) it works just fine -- saves small matrices as three different matrices on disc. However, when i try it with real size, like with r-matrix(normal(5000), nr=1000) I am either getting seg fault on saving the third big matrix, or hang forever. Am i doing something obviously wrong, or is it an unstable package at the moment? Could anyone recommend something similar that is reliable in this case? -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Huge data sets and RAM problems
Stella, A few brief words of advice: 1. Work through your code a line at a time, making sure that each is what you would expect. I think some of your later problems are a result of something early not being as expected. For example, if the read.delim() is in fact not giving you what you expect, stop there before moving onwards. I suspect some funny character(s) or character encodings might be a problem. 2. 32-bit Windows can be limiting. With 2 GB of RAM, you're probably not going to be able to work effectively in native R with objects over 200-300 MB, and the error indicates that something (you or a package you're using) simply have run out of memory. So... 3. Consider more RAM (and preferably with 64-bit R). Other solutions might be possible, such as using a database to hand the data transition into R. 2.5 million rows by 18 columns is apt to be around 360 MB. Although you can afford 1 (or a few) copies of this, it doesn't leave you much room for the memory overhead of working with such an object. Part of the oringal message below. Jay - Message: 80 Date: Mon, 19 Apr 2010 22:07:03 +0200 From: Stella Pachidi stella.pach...@gmail.com To: r-h...@stat.math.ethz.ch Subject: [R] Huge data sets and RAM problems Message-ID: g2j133363581004191307t2a48c1bfqd9d57cf0d6c62...@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1 Dear all, I am using R 2.10.1 in a laptop with Windows 7 - 32bit system, 2GB RAM and CPU Intel Core Duo 2GHz. . Finally, another problem I have is when I perform association mining on the data set using the package arules: I turn the data frame into transactions table and then run the apriori algorithm. When I put too low support in order to manage to find the rules I need, the vector of rules becomes too big and I get problems with the memory such as: Error: cannot allocate vector of size 923.1 Mb In addition: Warning messages: 1: In items(x) : Reached total allocation of 153Mb: see help(memory.size) Could you please help me with how I could allocate more RAM? Or, do you think there is a way to process the data by loading them into a document instead of loading all into RAM? Do you know how I could manage to read all my data set? I would really appreciate your help. Kind regards, Stella Pachidi -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] large dataset
A little more information would help, such as the number of columns? I imagine it must be large, because 100,000 rows isn't overwhelming. Second, does the read.csv() fail, or does it work but only after a long time? And third, how much RAM do you have available? R Core provides some guidelines in the Installation and Administration documentation that suggests that a single object around 10% of your RAM is reasonable, but beyond that things can become challenging, particularly once you start working with your data. There are a wide range of packages to help with large data sets. For example, RMySQL supports MySQL databases. At the other end of the spectrum, there are possibilities discussed on a nice page by Dirk Eddelbuettel which you might look at: http://cran.r-project.org/web/views/HighPerformanceComputing.html Jay -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay (original message below) -- Message: 128 Date: Sat, 27 Mar 2010 10:19:33 +0100 From: n\.vial...@libero\.it n.via...@libero.it To: r-help r-help@r-project.org Subject: [R] large dataset Message-ID: kzxokl$991aa2d6c95c3bd9f464c3b32b78b...@libero.it Content-Type: text/plain; charset=iso-8859-1 Hi I have a question, as im not able to import a csv file which contains a big dataset(100.000 records) someone knows how many records R can handle without giving problems? What im facing when i try to import the file is that R generates more than 100.000 records and is very slow... thanks a lot!!! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Mosaic plots
As pointed out by others, vcd supports mosaic plots on top of the grid engine (which is extremely helpful for those of us who love playing around with grid). The standard mosaicplot() function is directly available (it isn't clear if you knew this). The proper display of names is a real challenge faced by all of us with these plots, so you should try each version. I'm not sure what you intend to do with a legend, but if you want the ability to customize and hack code, I suggest you look at grid and a modification to vcd's version to suit your purposes. Jay Subject: [R] Mosaic Plots Message-ID: 1269256874432-1677468.p...@n4.nabble.com Content-Type: text/plain; charset=us-ascii Hello Everyone I want to plot Moasic Plots, I have tried them using iplots package (using imosaic). The problem is the names dont get alligned properly, is there a way to a align the names and provide legend in Mosaic plots using R? Also I would like to know any other packages using which I can plot Mosaic Plots Thank you in advance Sunita -- -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay http://www.stat.yale.edu/%7Ejay -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about bigmemory: releasing RAM from a big.matrix that isn't used anymore
See inline for responses. But people are always welcome to contact us directly. Hi all, I'm on a Linux server with 48Gb RAM. I did the following: x - big.matrix(nrow=2,ncol=50,type='short',init=0,dimnames=list(1:2,1:50)) #Gets around the 2^31 issue - yeah! We strongly discourage use of dimnames. in Unix, when I hit the top command, I see R is taking up about 18Gb RAM, even though the object x is 0 bytes in R. That's fine: that's how bigmemory is supposed to work I guess. My question is how do I return that RAM to the system once I don't want to use x any more? E.g., rm(x) then top in Unix, I expect that my RAM footprint is back ~0, but it remains at 18Gb. How do I return RAM to the system? It can take a while for the OS to free up memory, even after a gc(). But it's available for re-use; if you want to be really sure, have a look in /dev/shm to make sure the shared memory segments have been deleted. Thanks, Matt -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay http://www.stat.yale.edu/%7Ejay [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Multicore package: sharing/modifying variable accross processes
Renaud, Package bigmemory can help you with shared-memory matrices, either in RAM or filebacked. Mutex support currently exists as part of the package, although for various reasons will soon be abstracted from the package and provided via a new package, synchronicity. bigmemory works beautifully with multicore. Feel free to email us with questions, and we appreciate feedback. Jay Original message: Hi, I want to parallelize some computations when it's possible on multicore machines. Each computation produces a big objects that I don't want to store if not necessary: in the end only the object that best fits my data have to be returned. In non-parallel mode, a single gloabl object is updated if the current computation gets a better result than the best previously found. My plan was to use package multicore. But there is obviously an issue of concurrent access to the global result variable. Is there a way to implement something like a lock/mutex to ensure make the procedure thread safe? Maybe something already exist to deal with such things? It looks like package multicore run the different processes in different environments with copy-on-change of everything when forking. Anybody has experimented working with a shared environment with package multicore? -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Estimation in a changepoint regression with R
Package bcp does Bayesian changepoint analysis, though not in the general regression framework. The most recent reference is Bioinformatics 24(19) 2143-2148; doi: 10.1093/bioinformatics/btn404; slightly older is JSS 23(3). Both reference some alternatives you might want to consider (including strucchange, among others). Jay Message: 4 Date: Thu, 15 Oct 2009 03:56:22 -0700 (PDT) From: FMH kagba2...@yahoo.com Subject: [R] Estimation in a changepoint regression with R To: r-help@r-project.org Message-ID: 365399.56401...@web38303.mail.mud.yahoo.com Content-Type: text/plain; charset=iso-8859-1 Dear All, I'm trying to do the estimation in a changepoint regression problem via R, but never found any suitable function which might help me to do this. Could someone give me a hand?on this matter? Thank you. -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] reading web log file into R
Sebastian, There is rarely a completely free lunch, but fortunately for us R has some wonderful tools to make this possible. R supports regular expressions with commands like grep(), gsub(), strsplit(), and others documented on the help pages. It's just a matter of constructing and algorithm that does the job. In your case, for example (though please note there are probably many different, completely reasonable approaches in R): x - scan(logfilename, what=, sep=\n) should give you a vector of character strings, one line per element. Now, lines containing GET seem to identify interesting lines, so x - x[grep(GET, x)] should trim it to only the interesting lines. If you want information from other lines, you'll have to treat them separately. Next, you might try y - strsplit(x) which by default splits on whitespace, returning a list (one component per line) of vectors based on the split. Try it. It it looks good, you might check lapply(y, length) to see if all lines contain the same number of records. If so, you can then get quickly into a matrix, z - matrix(unlist(strsplit(x)), ncol=K, byrow=TRUE) where K is the common length you just observed. If you think this is cool, great! If not, well... hire a programmer, or if you're lucky Microsoft or Apache have tools to help you with this. There might be something in the Perl/Python world. Or maybe there's a package in R designed just for this, but I encourage students to develop the raw skills... Jay -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] kmeans.big.matrix
This sort of question is ideal to send directly to the maintainer. We've removed kmeans.big.matrix for the time being and will place it in a new package, bigmemoryAnalytics. bigmemory itself is the core building block and tool, and we don't want to pollute it with lots of extras. Allan's point is right: big data packages (like bigmemory and ff) can't be used directly with R functions (like lm). And because of R's design you can't extract subsets with more than 2^31-1 elements, even though the big.matrix can be as large as you need (with filebacking). I hope that helps. Jay -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Building a big.matrix using foreach
Michael, If you have a big.matrix, you just want to iterate over the rows. I'm not in R and am just making this up on the fly (from a bar in Beijing, if you believe that): foreach(i=1:nrow(x),.combine=c) %dopar% f(x[i,]) should work, essentially applying the functin f() to the rows of x? But perhaps I misunderstand you. Please feel free to email me or Mike ( michael.k...@yale.edu) directoy with questions about bigmemory, we are very interested in applications of it to real problems. Note that the package foreach uses package iterators, and is very flexible, in case you need more general iteration in parellel. Regards, Jay Original message: Hi there! I have become a big fan of the 'foreach' package allowing me to do a lot of stuff in parallel. For example, evaluating the function f on all elements in a vector x is easily accomplished: foreach(i=1:length(x),.combine=c) %dopar% f(x[i]) Here the .combine=c option tells foreach to combine output using the c()-function. That is, to return it as a vector. Today I discovered the 'bigmemory' package, and I would like to contruct a big.matrix in a parralel fashion row by row. To use foreach I see no other way than to come up with a substitute for c in the .combine option. I have checked out the big.matrix manual, but I can't find a function suitable for just that. Actually, I wouldn't even know how to do it for a usual matrix. Any clues? Thanks! -- Michael Knudsen micknud...@gmail.com http://lifeofknudsen.blogspot.com/ -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] [R-pkgs] Major bigmemory revision released.
The re-engineered bigmemory package is now available (Version 3.5 and above) on CRAN. We strongly recommend you cease using the older versions at this point. bigmemory now offers completely platform-independent support for the big.matrix class in shared memory and, optionally, as filebacked matrices for larger-than-RAM applications. We're working on updating the package vignette, and a draft is available upon request (just send me an email if you're interested). The user interface is largely unchanged. Feedback, bug reports, etc... are welcome. Jay Emerson Michael Kane -- John W. Emerson (Jay) Assistant Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay [[alternative HTML version deleted]] ___ R-packages mailing list r-packa...@r-project.org https://stat.ethz.ch/mailman/listinfo/r-packages __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Using very large matrix
Steve et.al., The old version is still on CRAN, but I strongly encourage anyone interested to email me directly and I'll make the new version available. In fact, I wouldn't mind just pulling the old version off of CRAN, but of course that's not a great idea. !-) Jay On Mon, Mar 2, 2009 at 8:47 AM, steve_fried...@nps.gov wrote: I'm very interested in the bigmemory package for windows 32-bit environments. Who do I need to contact to request the Beta version? Thanks Steve Steve Friedman Ph. D. Spatial Statistical Analyst Everglades and Dry Tortugas National Park 950 N Krome Ave (3rd Floor) Homestead, Florida 33034 steve_fried...@nps.gov Office (305) 224 - 4282 Fax (305) 224 - 4147 Corrado ct...@york.ac.uk To Sent by: john.emer...@yale.edu, Tony Breyal r-help-boun...@r- tony.bre...@googlemail.com project.org cc r-help@r-project.org Subject 03/02/2009 10:46 Re: [R] Using very large matrix AM GMT Thanks a lot! Unfortunately, the R package I have to sue for my research was only released on 32 bit R on 32 bit MS Windows and only closed source I normally use 64 bit R on 64 bit Linux :) I tried to use the bigmemory in cran with 32 bit windows, but I had some serious problems. Best, On Thursday 26 February 2009 15:43:11 Jay Emerson wrote: Corrado, Package bigmemory has undergone a major re-engineering and will be available soon (available now in Beta version upon request). The version currently on CRAN is probably of limited use unless you're in Linux. bigmemory may be useful to you for data management, at the very least, where x - filebacked.big.matrix(8, 8, init=n, type=double) would accomplish what you want using filebacking (disk space) to hold the object. But even this requires 64-bit R (Linux or Mac, or perhaps a Beta version of Windows 64-bit R that REvolution Computing is working on). Subsequent operations (e.g. extraction of a small portion for analysis) are then easy enough: y - x[1,] would give you the first row of x as an object y in R. Note that x is not itself an R matrix, and most existing R analytics can't work on x directly (and would max out the RAM if they tried, anyway). Feel free to email me for more information (and this invitation applies to anyone who is interested in this). Cheers, Jay #Dear friends, # #I have to use a very large matrix. Something of the sort of #matrix(8,8,n) where n is something numeric of the sort 0.xx # #I have not found a way of doing it. I keep getting the error # #Error in matrix(nrow = 8, ncol = 8, 0.2) : too many elements specified # #Any suggestions? I have searched the mailing list, but to no avail. # #Best, #-- #Corrado Topi # #Global Climate Change Biodiversity Indicators #Area 18,Department of Biology #University of York, York, YO10 5YW, UK #Phone: + 44 (0) 1904 328645, E-mail: ct...@york.ac.uk -- Corrado Topi Global Climate Change Biodiversity Indicators Area 18,Department of Biology University of York, York, YO10 5YW, UK Phone: + 44 (0) 1904 328645, E-mail: ct...@york.ac.uk __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- John W. Emerson (Jay) Assistant Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Using very large matrix
Corrado, Package bigmemory has undergone a major re-engineering and will be available soon (available now in Beta version upon request). The version currently on CRAN is probably of limited use unless you're in Linux. bigmemory may be useful to you for data management, at the very least, where x - filebacked.big.matrix(8, 8, init=n, type=double) would accomplish what you want using filebacking (disk space) to hold the object. But even this requires 64-bit R (Linux or Mac, or perhaps a Beta version of Windows 64-bit R that REvolution Computing is working on). Subsequent operations (e.g. extraction of a small portion for analysis) are then easy enough: y - x[1,] would give you the first row of x as an object y in R. Note that x is not itself an R matrix, and most existing R analytics can't work on x directly (and would max out the RAM if they tried, anyway). Feel free to email me for more information (and this invitation applies to anyone who is interested in this). Cheers, Jay #Dear friends, # #I have to use a very large matrix. Something of the sort of #matrix(8,8,n) where n is something numeric of the sort 0.xx # #I have not found a way of doing it. I keep getting the error # #Error in matrix(nrow = 8, ncol = 8, 0.2) : too many elements specified # #Any suggestions? I have searched the mailing list, but to no avail. # #Best, #-- #Corrado Topi # #Global Climate Change Biodiversity Indicators #Area 18,Department of Biology #University of York, York, YO10 5YW, UK #Phone: + 44 (0) 1904 328645, E-mail: ct...@york.ac.uk -- John W. Emerson (Jay) Assistant Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R package building
I agree with others that the packaging system is generally easy to use, and between the Writing R Extensions documentation and other scattered sources (including these lists) there shouldn't be many obstacles. Using package.skeleton() is a great way to get started: I'd recommend just having one data object and one new function in the session for starters. You can build up from there. I've only ran into time-consuming walls on more advanced, obscure issues. For example: the Suggests: field in DESCRIPTION generated quite some debate back in 2005, but until I found that thread in the email lists I didn't understand the issue. For completeness, I'll round out this discussion, hoping I'm correct. In essence, I think the choice of the word Suggests: was intended for the package user, not for the developer. The user isn't required to have a suggested package in order to load and use the desired package. But the developer is required (in the R CMD check) to have the suggested package in order to avoid warnings or fails. This does, actually, make sense, because we assume a developer would want/need to check features that involve the suggested package. In a few isolated cases (I think I had one of them), this caused a problem, where a desired suggested package isn't distributed by CRAN on all platforms, so I would risk getting into trouble with R CMD check on the platform without the suggested package. But this is pretty obscure, and the issue was obviously well-debated in the past. The addition of a line or two about this in Writing R Extensions would be friendly (the current content is correct and minimal sufficient I believe). Maybe I should draft this and submit it to the group. Secondly, I would advice a newbie to the packaging system to avoid S4 at first. Ultimately, I think it's pretty cool. But, for example, documentation on proper documentation (to handle the man pages correctly) has puzzled me, and even though I can create a package with S4 that passes R CMD check cleanly, I'm not convinced I've got it quite right. If someone has recently created more documentation or a 'white pages' on this, please do spread the word. Thanks to all who have -- and continue -- to work on the system! Jay Subject: [R] R package building In a few days I'll give a talk on R package development and my personal experience, at the 3rd Free / Libre / Open Source Software (FLOSS) Conference which will take place on May 27th 28th 2008, in the National Technical University of Athens, in Greece. I would appreciate if you could share your thoughts with me; What are today's obstacles on R package building, according to your opinion and personal experience. Thanks, -- Angelos I. Markos Scientific Associate, Dpt of Exact Sciences, TEI of Thessaloniki, GR I'm not an outlier; I just haven't found my distribution yet -- John W. Emerson (Jay) Assistant Professor of Statistics Director of Graduate Studies Department of Statistics Yale University http://www.stat.yale.edu/~jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R on a computer cluster
Gabriele, In addition to the suggestions from Markus (below), there is NetWorkSpaces (package nws). I have used both nws and snow together with a package I'm developing (bigmemoRy) which allocates matrices to shared memory (helping avoid the bottleneck Markus alluded to for processors on the same computer). Both seem quite easy to use, essentially only needing one command to initiate the cluster and then one command to do something like apply() in parallel. It takes a little planning of your application, but the painfully obvious parallel problem should be painless to implement. Jay Hi, your required performance is strongly depending on your application. If you talk about a cluster, you should think about several computers. Not only one computer with several processors. If you have several computers. First of all you have to decide for a communication protocol for parallel computing: MPI, PVM, ... Then you have to install this at your computers. I think you should use MPI and one of its implementations: OpenMPI, LamMPI Then there are several R packages for using the communication protocols: Rmpi, snow, Rpvm, ... If you have one computer with severals processors, you can do the same thinks. But then you have only shared memory (bottleneck) and there is not to much improvement in performance. R is not yet implemented for multiple-processors. There is one first, experimental R package using openMP for multi threading: pnmath (http://www.stat.uiowa.edu/~luke/R/experimental/) Some useful links: http://www.stats.uwo.ca/faculty/yu/Rmpi/ http://ace.acadiau.ca/math/ACMMaC/Rmpi/ http://www.open-mpi.org/ http://www.personal.leeds.ac.uk/~bgy1mm/MPITutorial/MPIHome.html Best regards Markus [EMAIL PROTECTED] schrieb: Dear all, I usually run R on my laptop with Windows XP Professional. Now I really want to run R on a computer cluster (4 processors) with Suse Linux Enterprise ver. 10. But I am new with computer cluster. Should I modify my functions in order to use the greater performance and availability than that provided by my laptop? Is there any R manual on parallel computations on multiple-processor? Any suggestion on a basic tutorial on this topic? Thank you. -- John W. Emerson (Jay) Assistant Professor of Statistics Director of Graduate Studies Department of Statistics Yale University http://www.stat.yale.edu/~jay REvolution Computing, Statistical Consultant __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory problem?
Elena, Page 23 of the R Installation Guide provides some memory guidelines that you might find helpful. There are a few things you could try using R, at least to get up and running: - Look at fewer tumors at a time using standard R as you have been. - Look at the ff package, which leaves the data in flat files with memory mapped pages. - It may be that package filehash does something similar using a database (I'm less familiar with this). - Wait for the upcoming package bigmemoRy package, which is designed to place large objects like this in RAM (using C++) but gives you a close-to-seamless interaction with it from R. Caveat below. With any of these options, you are still very much restricted by the type of analysis you are attempting. Almost any existing procedure (e.g. a cox model) would need a regular R object (probably impossible) and you are back to square one. An exception to this is Thomas Lumley's biglm package, which processes the data in chunks. We need more tools like these. Ultimately, you'll need to find some method of analysis that is pretty smart memory-wise, and this may not be easy. Best of luck, Jay - Original message: I am trying to run a cox model for the prediction of relapse of 80 cancer tumors, taking into account the expression of 17000 genes. The data are large and I retrieve an error: Cannot allocate vector of 2.4 Mb. I increase the memory.limit to 4000 (which is the largest supported by my computer) but I still retrieve the error because of other big variables that I have in the workspace. Does anyone know how to overcome this problem? Many thanks in advance, Eleni -- John W. Emerson (Jay) Assistant Professor of Statistics Director of Graduate Studies Department of Statistics Yale University http://www.stat.yale.edu/~jay Statistical Consultant, REvolution Computing __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.