from:"Jay Emerson"

Re: [R] Memory limit for Windows 64bit build of R

2012-08-06 Thread Jay Emerson

Alan,

More RAM will definitely help.  But if you have an object needing more than
2^31-1 ~ 2 billion elements, you'll hit a wall regardless.  This could be
particularly limiting for matrices.  It is less limiting for data.frame
objects (where each column could be 2 billion elements).  But many R
analytics under the hood use matrices, so you may not know up front where
you could hit a limit.

Jay

 Original message 
I have a Windows Server 2008 R2 Enterprise machine, with 64bit R installed
running on 2 x Quad-core Intel Xeon 5500 processor with 24GB DDR3 1066 Mhz
RAM.  I am seeking to analyse very large data sets (perhaps as much as
10GB), without the addtional coding overhead of a package such as
bigmemory().

My question is this - if we were to increase the RAM on the machine to
(say) 128GB, would this become a possibility?  I have read the
documentation on memory limits and it seems so, but would like some
additional confirmation before investing in any extra RAM.
-


-- 
John W. Emerson (Jay)
Associate Professor of Statistics, Adjunct, and Acting Director of Graduate
Studies
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] bigmemory

2012-05-11 Thread Jay Emerson

To answer your first question about read.big.matrix(), we don't know what
your acc3.dat file is, but it doesn't appear to have been detected as a
standard file (like a CSV file) or -- perhaps -- doesn't even exist (or
doesn't exist in your current directory)?

Next:

 In addition, I am planning to do a multiple imputation with MICE package
 using the data read by bigmemory package.
 So usually, the multiple imputation code is like this:
  imp=mice(data.frame,m=50,seed=1234,print=F)
 the data.frame is required. How can I change the big.matrix class
 generated by bigmemory package to a data.frame?

Please read the help files for bigmemory -- only matrix-like objects are
supported.  However, the more serious problem is that you can't expect to
run just any R function on a big.matrix (or on an ff object, if you check
out ff for some nice features).  In particular, for large data sets you
would likely use up all of RAM (other reasons are more subtle and
important, but out of place in this reply).

Jay

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] bigmemory

2012-05-11 Thread Jay Emerson

R internally uses 32-bit integers for indexing (though this may change).
For this and other reasons these external objects with specialized purposes
(larger-than-RAM, shared memory) simply can't behave exactly as R objects.
Best case, some R functions will work.  Others would simply break.  Others
would perhaps work if the problem is small enough, but would choke in the
creation of temporary objects in memory.

I understand your sentiment, but it isn't that easy.  If you are
interested, however, we do provide examples of authoring functions in C++
which can work interchangeably on both matrix and big.matrix objects.

Jay

 Hi Jay,

 I have a question about your reply.

 You mentioned that the more serious problem is that you can't expect to
 run just any R function on a big.matrix (or on an ff object, if you check
 out ff for some nice features).  

 I am confused why the packages could not communicate with each other. I
 understand that maybe for some programming or statistical reasons, one
 package need its own class so that specific algorithm can be implemented.
 However, R as a statistical programming environment, one of its advantages
 is the abundance of the packages under R structure. If different packages
 generate different kinds of object and can not be recognized and used for
 further analysis by other packages, then each package would appears to be
 similar with the normal independent software, e.g., SAS, MATLAB... then
 this could reduce the whole R ability for handling complicated analysis
 situation.

 This is just a general thought.

 Thank you very much.

 --
 ya

 --
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] bigmemory on Solaris

2011-12-01 Thread Jay Emerson

At one point we might have gotten something working (older version?) on
Solaris x86, but were never successful on Solaris sparc that I remember --
it isn't a platform we can test and support.  We believe there are problems
with BOOST library compatibilities.

We'll try (again) to clear up the other warnings in the logs, though.  !-)

We should also revisit the possibility of a CRAN BOOST library for use by a
small group of packages (like bigmemory) which might make patches to BOOST
easier to track and maintain.  This might improve things in the long run.

Jay

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Foreach (doMC)

2011-10-20 Thread Jay Emerson

Jannis,

I'm not complete sure I understand your first point, but maybe someone
from REvolution will weigh in.  Nobody is forcing anyone to purchase
any products, and there are attractive alternatives such as the CRAN R
and R Studio (to name two).  This issue has arisen many times of the
various lists and you are welcome to search the archives and read many
very intelligent, thoughtful opinions.

As for foreach, etc... if you have fairly focused questions
(preferably with a reproducible example if there is a problem) and if
you have done reading on examples available on using it, then you
might try joining the r-sig-...@r-project.org group.  Clearly there
are far more users of core R and hence mainstream questions on
r-help are likely to be answered more quickly (on average) than
specialized questions.

Regards,

Jay

On Thu, Oct 20, 2011 at 4:27 PM, Jannis bt_jan...@yahoo.de wrote:
 Dear list members, dear Jay,

 Well, I personally do not care about Revolutions Analytics selling their
 products as this is also included into the idea of many open source
 licences. Especially as Revolutions provide their packages to the community
 and its is everybodies personal choice to buy their special R version.

 I was just wondering about this issue as usually most questions on r-help
 are answered pretty soon and by many different people and I had the
 impression that this is not the case for posts regarding the
 foreach/doMC/doSMP etc packages. This may, however, be also due to the
 probably limited use of these packages for most users who do not need these
 high performance computing things. Or it was just my personal perception or
 pure chance.

 Thanks however, to the authors of such packages! They were of great help to
 me on several ocasions and I have deep respect for everybody devoting his
 time to open source software!

 Jannis



 On 10/19/2011 01:26 PM, Jay Emerson wrote:

 P.S. Is there any particular reason why there are so seldom answers to
 posts regarding foreach and all these doMC/doSMP packages ?  Do so few
 people use these packages or does this have anything to do with the
 commercial origin of these packages?

 Jannis,

 An interesting question.  I'm a huge fan of foreach and the parallel
 backends, and have used foreach in some of my packages.  It leaves the
 choice of backend to the user, rather than forcing some environment.
 If you like multicore, great -- the package doesn't care.  Someone
 else may use doSNOW.  No problem.

 To answer your question, foreach was originally written by (primarily,
 at least) Steve Weston, previously of REvolution Computing.  It, along
 with some of the parallel backends (perhaps all at this point, I'm out
 of touch) are available open-source.  Hence, I'd argue that the
 commercial origin is a moot point -- it doesn't matter, it will
 always be available, and it's really useful.  Steve is no longer with
 REvolution, however, and I can't speak for the responsiveness/interest
 of current REvolution folks on this point.  Scanning R-help daily for
 things relating to my own packages is something I try to do, but it
 doesn't always happen.

 I would like to think foreach is widely used -- it does have a growing
 list of reverse depends/suggests.  And was updated as recently as last
 May, I just noticed.
 http://cran.r-project.org/web/packages/foreach/index.html

 Jay






-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Foreach (doMC)

2011-10-19 Thread Jay Emerson

 P.S. Is there any particular reason why there are so seldom answers to posts 
 regarding foreach and all these doMC/doSMP packages ?  Do so few people use 
 these packages or does this have anything to do with the commercial origin of 
 these packages?

Jannis,

An interesting question.  I'm a huge fan of foreach and the parallel
backends, and have used foreach in some of my packages.  It leaves the
choice of backend to the user, rather than forcing some environment.
If you like multicore, great -- the package doesn't care.  Someone
else may use doSNOW.  No problem.

To answer your question, foreach was originally written by (primarily,
at least) Steve Weston, previously of REvolution Computing.  It, along
with some of the parallel backends (perhaps all at this point, I'm out
of touch) are available open-source.  Hence, I'd argue that the
commercial origin is a moot point -- it doesn't matter, it will
always be available, and it's really useful.  Steve is no longer with
REvolution, however, and I can't speak for the responsiveness/interest
of current REvolution folks on this point.  Scanning R-help daily for
things relating to my own packages is something I try to do, but it
doesn't always happen.

I would like to think foreach is widely used -- it does have a growing
list of reverse depends/suggests.  And was updated as recently as last
May, I just noticed.
http://cran.r-project.org/web/packages/foreach/index.html

Jay

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] efficient coding with foreach and bigmemory

2011-09-30 Thread Jay Emerson

First, we strongly recommend 64-bit R.  Otherwise, you may not be able
to scale up as far as you would like.

Second, as I think you realize, with big objects you may have to do
things in chunks.  I generally recommend working a column at a time
rather than in blocks of rows if possible (better performance,
particularly if the filebacking is used because of matrices exceeding
RAM), and you may find that alternative data organization can really
pay off.  Keep an open mind.

Third, you really need to avoid this runif(1,...) usage.  It can't
possibly be efficient.  If a single call to runif() doesn't work,
break it into chunks, certainly, but going down to chunks of size 1
just can't make any sense.

Fourth, although you aren't there yet, once you get to the point you
are trying to do things in parallel with foreach and bigmemory, you
*may* need to place the following inside your foreach loop to make use
of the shared memory properly:

mdesc - describe(m)
foreach(...) %dopar% {
  require(bigmemory)
  m - attach.big.matrix(mdesc)
    now operate on m
}

I say *may* because the backend doMC (not available in Windows) does
not require this, but the other backends do; otherwise, the workers
will not be able to properly address the shared-memory or filebacked
big.matrix.  Some documentation on bigmemory.org may help, and feel
free to email us directly.

Jay


-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Exception while using NeweyWest function with doMC

2011-08-29 Thread Jay Emerson

Simon,

Though we're please to see another use of bigmemory, it really isn't
clear that it is gaining you
anything in your example; anything like as.big.matrix(matrix(...))
still consumes full RAM for both
the inner matrix() and the new big.matrix -- is the filebacking really
necessary.  It also doesn't
appear that you are making use of shared memory, so I'm unsure what
the gains are.  However,
I don't have any particular insight as to the subsequent problem with
NeweyWest (which doesn't
seem to be using the big.matrix objects).

Jay

--
Message: 32
Date: Sat, 27 Aug 2011 21:37:55 +0200
From: Simon Zehnder simon.zehn...@googlemail.com
To: r-help@r-project.org
Subject: [R] Exception while using NeweyWest function with doMC
Message-ID:
   cagqvrp_gk+t0owbv1ste-y0zafmi9s_zwqrxyxugsui18ms...@mail.gmail.com
Content-Type: text/plain

Dear R users,

I am using R right now for a simulation of a model that needs a lot of
memory. Therefore I use the *bigmemory* package and - to make it faster -
the *doMC* package. See my code posted on http://pastebin.com/dFRGdNrG

 snip 
-

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Installation of bigmemory fails

2011-06-25 Thread Jay Emerson

Premal,

Package authors generally welcome direct emails.

We've been away from this project since the release of 2.13.0 and I only just
noticed the build errors.  These generally occur because of some (usually
small and solvable) problem with compilers and the BOOST libraries.  We'll
look at it and see what we can do.  Please email us if you don't hear back
in the next week or so.

Thanks,

Jay


---
Hello All,
I tried to intall the bigmemory package from a CRAN mirror site and
received the following output while installing. Any idea what's going
on and how to fix it? The system details are provided below.

- begin error messages ---
* installing *source* package 'bigmemory' ...
 checking for Sun Studio compiler...no
 checking for Darwin...yes
** libs
g++45 -I/usr/local/lib/R/include -I../inst/include -fpic  -O2
-fno-strict-aliasing -pipe -Wl,-rpath=/usr/local/lib/gcc45 -c B\
igMatrix.cpp -o BigMatrix.o
g++45 -I/usr/local/lib/R/include -I../inst/include -fpic  -O2
-fno-strict-aliasing -pipe -Wl,-rpath=/usr/local/lib/gcc45 -c S\
haredCounter.cpp -o SharedCounter.o
g++45 -I/usr/local/lib/R/include -I../inst/include -fpic  -O2
-fno-strict-aliasing -pipe -Wl,-rpath=/usr/local/lib/gcc45 -c b\
igmemory.cpp -o bigmemory.o
bigmemory.cpp: In function 'bool TooManyRIndices(index_type)':
bigmemory.cpp:40:27: error: 'powl' was not declared in this scope
*** Error code 1

Stop in /tmp/Rtmpxwe3p4/R.INSTALL4f539336/bigmemory/src.
ERROR: compilation failed for package 'bigmemory'
* removing '/usr/local/lib/R/library/bigmemory'

The downloaded packages are in
   '/tmp/RtmpMZCOVp/downloaded_packages'
Updating HTML index of packages in '.Library'
Making packages.html  ... done
Warning message:
In install.packages(bigmemory) :
 installation of package 'bigmemory' had non-zero exit status
- end error messages -
It's a 64-bit FreeBSD 7.2 system running R version 2-13.0.
Thanks,
Premal

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Kolmogorov-smirnov test

2011-02-28 Thread Jay Emerson

Taylor Arnold and I have developed a package ks.test (available on R-Forge
in beta version) that modifies stats::ks.test to handle discrete null
distributions
for one-sample tests.  We also have a draft of a paper we could provide (email
us).  The package uses methodology of Conover (1972) and Gleser (1985) to
provide exact p-values.  It also corrects an algorithmic problem with
stats::ks.test
in the calculation of the test statistic.  This is not a bug, per se,
because it was
never intended to be used this way.  We will submit this new function for
inclusion in package stats once we're done testing.

So, for example:
# With the default ks.test (ouch):
 stats::ks.test(c(0,1), ecdf(c(0,1)))

One-sample Kolmogorov-Smirnov test

data:  c(0, 1)
D = 0.5, p-value = 0.5
alternative hypothesis: two-sided

# With our new function (what you would want in this toy example):
 ks.test::ks.test(c(0,1), ecdf(c(0,1)))

One-sample Kolmogorov-Smirnov test

data:  c(0, 1)
D = 0, p-value = 1
alternative hypothesis: two-sided



Original Message:

Date: Mon, 28 Feb 2011 21:31:26 +1100
From: Glen Barnett glnbr...@gmail.com
To: tsippel tsip...@gmail.com
Cc: r-help@r-project.org
Subject: Re: [R] Kolmogorov-smirnov test
Message-ID:
   aanlktikcjigrgjuotkozqfxfqatin6arzjvt_appi...@mail.gmail.com
Content-Type: text/plain; charset=ISO-8859-1

It's designed for continuous distributions. See the first sentence here:

http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

K-S is conservative on discrete distributions

On Sat, Feb 19, 2011 at 1:52 PM, tsippel tsip...@gmail.com wrote:
 Is the kolmogorov-smirnov test valid on both continuous and discrete data?
 ?I don't think so, and the example below helped me understand why.

 A suggestion on testing the discrete data would be appreciated.

 Thanks,

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] lm without intercept

2011-02-18 Thread Jay Emerson

No, this is a cute problem, though: the definition of R^2 changes
without the intercept, because the
empty model used for calculating the total sums of squares is always
predicting 0 (so the total sums
of squares are sums of squares of the observations themselves, without
centering around the sample
mean).

Your interpretation of the p-value for the intercept in the first
model is also backwards: 0.9535 is extremely
weak evidence against the hypothesis that the intercept is 0.  That
is, the intercept might be near zero, but
could also be something veru different.  With a standard error of 229,
your 95% confidence interval
for the intercept (if you trusted it based on other things) would have
a margin of error of well over 400.  If you
told me that an intercept of, say 350 or 400 were consistent with your
knowledge of the problem, I wouldn't
blink.

This is a very small data set: if you sent an R command such as:

x - c(x1, x2, ..., xn)
y - c(y1, y2, ..., yn)

you might even get some more interesting feedback.  One of the many
good intro stats textbooks might
also be helpful as you get up to speed.

Jay
-
Original post:

Message: 135
Date: Fri, 18 Feb 2011 11:49:41 +0100
From: Jan jrheinlaen...@gmx.de
To: R-help@r-project.org list r-help@r-project.org
Subject: [R] lm without intercept
Message-ID: 1298026181.2847.19.camel@jan-laptop
Content-Type: text/plain; charset=UTF-8

Hi,

I am not a statistics expert, so I have this question. A linear model
gives me the following summary:

Call:
lm(formula = N ~ N_alt)

Residuals:
   Min  1Q  Median  3Q Max
-110.30  -35.80  -22.77   38.07  122.76

Coefficients:
   Estimate Std. Error t value Pr(|t|)
(Intercept)  13.5177   229.0764   0.059   0.9535
N_alt 0.2832 0.1501   1.886   0.0739 .
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

Residual standard error: 56.77 on 20 degrees of freedom
 (16 observations deleted due to missingness)
Multiple R-squared: 0.151, Adjusted R-squared: 0.1086
F-statistic: 3.558 on 1 and 20 DF,  p-value: 0.07386

The regression is not very good (high p-value, low R-squared).
The Pr value for the intercept seems to indicate that it is zero with a
very high probability (95.35%). So I repeat the regression forcing the
intercept to zero:

Call:
lm(formula = N ~ N_alt - 1)

Residuals:
   Min  1Q  Median  3Q Max
-110.11  -36.35  -22.13   38.59  123.23

Coefficients:
 Estimate Std. Error t value Pr(|t|)
N_alt 0.292046   0.007742   37.72   2e-16 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

Residual standard error: 55.41 on 21 degrees of freedom
 (16 observations deleted due to missingness)
Multiple R-squared: 0.9855, Adjusted R-squared: 0.9848
F-statistic:  1423 on 1 and 21 DF,  p-value:  2.2e-16

1. Is my interpretation correct?
2. Is it possible that just by forcing the intercept to become zero, a
bad regression becomes an extremely good one?
3. Why doesn't lm suggest a value of zero (or near zero) by itself if
the regression is so much better with it?

Please excuse my ignorance.

Jan Rheinl?nder


-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [Fwd: adding more columns in big.matrix object of bigmemory package]

2010-12-17 Thread Jay Emerson

For good reasons (having to do with avoiding copies of massive things)
we leave such merging to the user: create a new filebacking of the
proper size, and fill it (likely a column at a time, assuming you have
enough RAM to support that).

Jay

On Fri, Dec 17, 2010 at 2:16 AM, utkarshsinghal
utkarsh.sing...@global-analytics.com wrote:

Hi,

With reference to the mail below, I have large datasets, coming from various
different sources, which I can read into filebacked big.matrix using library
bigmemory. I want to merge them all into one 'big.matrix' object. (Later, I
want to run regression using library 'biglm').

I am unsuccessfully trying to do this from quite some time now. Can you
please suggest some way? Am I missing some already available function?

Even a functionality of the following will work for me:

Just appending more columns in an existing big.matrix object (not merging).
If the individual datasets are small enough to be read in usual R, just the
combined dataset is huge.

Any thoughts are welcome.

Thanks,
Utkarsh

Original Message
Subject: adding more columns in big.matrix object of bigmemory package
Date: Thu, 16 Dec 2010 18:29:38 +0530
From: utkarshsinghal utkarsh.sing...@global-analytics.com
To: r help r-h...@stat.math.ethz.ch

Hi all,

Is there any way I can add more columns to an existing filebacked big.matrix
object.

In general, I want a way to modify an existing big.matrix object, i.e., add
rows/columns, rename colnames, etc.
I tried the following:

library(bigmemory)
x =
read.big.matrix(test.csv,header=T,type=double,shared=T,backingfile=test.backup,descriptorfile=test.desc)
x[,v4] = new
Error in mmap(j, colnames(x)) :
Couldn't find a match to one of the arguments.
(The above functionality is presently there in usual data.frames in R.)

Thanks in advance,
Utkarsh

--
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] big data and lmer

2010-10-22 Thread Jay Emerson

Though bigmemory, ff, and other big data solutions (databases, etc...)
can help easily manage massive data, their data objects are not
natively compatible with all the advanced functionality of R.
Exceptions include lm and glm (both ff and bigmemory support his via
Lumley's biglm package), kmeans, and perhaps a few other things.  In
many cases, it's just a matter of someone deciding to port a
tool/analysis of interest to one of these different object types -- we
welcome collaborators and would be happy to offer advice if you want
to adapt something for bigmemory structures!

Jay

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] merging and working with big data sets

2010-10-12 Thread Jay Emerson

I can't speak for ff and filehash, but bigmemory's data structure
doesn't allow clever merges (for actually good reasons).  However,
it is still probably less painful (and faster) than other options,
though we don't implement it: we leave it to the user because details
may vary depending on the example and the code is trivial.

- Allocate an empty new filebacked big.matrix of the proper size.
- Fill it in chunks (typically a column at a time if you can afford
the RAM overhead, or a portion of a column at a time).   Column
operations are more efficient than row operations (again, because of
the internals of the data structure).
- Because you'll be using filebackings, RAM limitations won't matter
other than the overhead of copying each chunk.

I should note: if you used separated=TRUE, each column would have a
separate binary file, and a smart cbind() would be possible simply
by manipulating the descriptor file.  Again, not something we advise
or formally provide, but it wouldn't be hard.

Jay

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] bigmemory doubt

2010-09-08 Thread Jay Emerson

By far the easiest way to achieve this would be to use the bigmemory
C++ structures in your program itself.  However, if you do something
on your own (but fundamentally have a column-major matrix in shared
memory), it should be possible to play around with the pointer with
R/bigmemory to accomplish this, yes.  Feel free to email us directly
for advice.

Jay


 Message: 153
 Date: Wed, 8 Sep 2010 10:52:19 +0530 (IST)
 From: raje...@cse.iitm.ac.in raje...@cse.iitm.ac.in
 To: r-help  r-help@r-project.org
 Subject: [R] bigmemory doubt
 Message-ID:
   1204692515.13855.1283923339865.javamail.r...@mail.cse.iitm.ac.in
 Content-Type: text/plain

 Hi,

 Is it possible for me to read data from shared memory created by a vc++ 
 program into R using bigmemory?

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Bigmemory: Error Running Example

2010-08-11 Thread Jay Emerson

It seems very likely you are working on a 32-bit version of R, but it's a
little surprising still that you would have a problem with any single year.
Please tell us the operating system and version of R.  Did you preprocess
the airline CSV file using the utilities provided on bigmemory.org?  If you
don't, then anything character will be converted to NA.  Is your R
environment empty, or did you have other objects in memory?

It might help to just do some tests yourself:

x - big.matrix(nrow=100, ncol=10, ... other options .)

Make sure it works, then increase the size until you get a failure.  This
sort of exercise is extremely helpful in situations like this.

Jay


Subject: [R] Bigmemory: Error Running Example
Message-ID:
   
aanlktint+xsxiuyainbcstmbdkedtawb--wfccgnr...@mail.gmail.comaanlktint%2bxsxiuyainbcstmbdkedtawb--wfccgnr...@mail.gmail.com

Content-Type: text/plain

Hi,

I am trying to run the bigmemory example provided on the
http://www.bigmemory.org/

The example runs on the airline data and generates summary of the csv
files:-

library(bigmemory)
library(biganalytics)
x - read.big.matrix(2005.csv, type=integer, header=TRUE,
backingfile=airline.bin,
 descriptorfile=airline.desc,
extraCols=Age)
summary(x)


This runs fine for the provided csv for year 1987 (size=121MB). However, for
big files like for year 2005 (size=639MB), it gives following errors:-

Error in filebacked.big.matrix(nrow = nrow, ncol = ncol, type = type,  :
 Problem creating filebacked matrix.

Error: object 'x' not found
Error in summary(x) :
 error in evaluating the argument 'object' in selecting a method for
function 'summary'

Here is the output from running the memory.limit() :-
[1] 2047

Here is the output from running the memory.profile() :-

  NULL  symbolpairlist closure environment promise
 19381  3255706477 7443710
  language special builtinchar logical integer
121940 1781600   1506895188981
double complex   character ... anylist
  7983  17   47593   0   04073
 expressionbytecode externalptr weakref raw  S4
 2   0 618 117 1191838


Anyone who has previously worked with bigmemory before could throw some
light on it.
Were you able to run  the examples successfully?

Thanks in advance.

Harsh Yadav

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay http://www.stat.yale.edu/%7Ejay

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] (help) This is an R workspace memory processing question

2010-06-23 Thread Jay Emerson

You should look at packages like ff, bigmemory, RMySQL, and so on.  However,
you should really consider moving to a different platform for large-data
work (Linux, Mac, or Windows 7 64-bit).

Jay


-

This is an R workspace memory processing question.


  There is a method which from the R will control 10GB data at 500MB units?


  my work environment :

  R version : 2.11.1
  OS : WinXP Pro sp3

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Parallel computing on Windows (foreach) (Sergey Goriatchev)

2010-06-16 Thread Jay Emerson

foreach (or virtually anything you might use for concurrent programming)
only really makes sense if the work the clients are doing is substantial
enough to overwhelm the communication overhead.  And there are many ways to
accomplish the same task more or less efficiently (for example, doing blocks
of tasks in chunks rather than passing each one as an individual job).

But more to the point, doSNOW works just fine on an SMP, no problem, it
doesn't require a cluster.

Jay


example code omitted

Not only is the sequential foreach much slower than the simple
for-loop (as least in this particular instance), but I am not quite
sure how to make foreach run parallel. Where would I get this parallel
backend? I looked at doMC and doRedis, but these do not run on
Windows, as far as I understand. And doSNOW is something to use when
you have a cluster, while I have a simple dual-core PC.

It is not really clear for how to make parallel computing work. Please,
help.

Regards,
Sergey

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay http://www.stat.yale.edu/%7Ejay

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] [R-pkgs] Bayesian change point package bcp 2.2.0 available

2010-05-17 Thread Jay Emerson

Version 2.2.0 of package bcp is now available.  It replaces the
suggests of NetWorkSpaces (previously used for optional parallel MCMC)
with the dependency on package foreach, giving greater flexibility and
supporting a wider range of parallel backends (see doSNOW, doMC,
etc...).

For those unfamiliar with foreach (thanks to Steve Weston for this
contribution), it's a beautiful and highly portable looping construct
which can run sequentially or in parallel based on the user's actions
(rather than the programmer's choices).  We think other package
authors might want to consider taking advantage of it for tasks that
might be computationally intensive and could be easily done in
parallel.  Some vignettes are available at
http://cran.r-project.org/web/packages/foreach/index.html.

Jay Emerson  Chandra Erdman

(Apologies, the first version of this announcement was not plain-text.)

--
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

___
R-packages mailing list
r-packa...@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] [R-pkgs] bigmemory 4.2.3

2010-05-17 Thread Jay Emerson

The long-promised revision to bigmemory has arrived, with package
4.2.3 now on CRAN.  The mutexes (locks) have been extracted and will
be available through package synchronicity (on R-Forge, soon to appear
on CRAN).  Initial versions of packages biganalytics and bigtabulate
are on CRAN, and new versions which resolve the warnings and have
streamlined CRAN-friendly configurations will appear shortly.  Package
bigalgebra will remain on R-Forge for the time being as the
user-interface is developed and the configuration possibilities
expand.

For more information, please feel free to email us or visit
http://www.bigmemory.org/.

Jay Emerson  Mike Kane

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

___
R-packages mailing list
r-packa...@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] [R-pkgs] Bayesian change point package bcp 2.2.0 available

2010-05-10 Thread Jay Emerson

Version 2.2.0 of package bcp is now available.  It replaces the suggests of
NetWorkSpaces (previously used for optional parallel MCMC) with the
dependency on package foreach, giving greater flexibility and supporting a
wider range of parallel backends (see doSNOW, doMC, etc...).

For those unfamiliar with foreach (thanks to Steve Weston for this
contribution), it's a beautiful and highly portable looping construct which
can run sequentially or in parallel based on the user's actions (rather than
the programmer's choices).  We think other package authors might want to
consider taking advantage of it for tasks that might be computationally
intensive and could be easily done in parallel.  Some vignettes are
available at http://cran.r-project.org/web/packages/foreach/index.html.

Jay Emerson  Chandra Erdman

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

[[alternative HTML version deleted]]

___
R-packages mailing list
r-packa...@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] bigmemory package woes

2010-04-24 Thread Jay Emerson

Zerdna,

Please note that the CRAN version 3.12 is about
to be replaced by a new cluster of packages now on R-Forge; we consider the
new bigmemory = 4.0 to be stable and recommend you start using it
immediately.  Please see http://www.bigmemory.org.

In your case, two comments:

(1) Your for() loop will generate three identical copies of filebackings on
disk,
yes.  Note that when the loop exists, the R object xx will reference only
the
3rd of these, so xx[1,1] - 1 will modify only the third filebacking, not
the
first two.  You'll need to use the separate descriptor files (probably
created
automatically for you, but we recommend naming them specifically using
descriptorfile=) to attach.big.matrix() whatever of these you really want to
be using.

(2) In the problem with hanging I believe you have exhausted the shared
resources on your system.  This problem will no longer arise in the = 4.0
problems, as we're handling mutexes separately rather than automatically.
These shared resource limits are mysterious, depending on the OS as well
as the hardware and other jobs or tasks in existence at any given point in
time.
But again, it shouldn't be a problem with the new version.

The CRAN update should take place early next week, along with some revised
documentation.

Regards,

Jay

---


Message: 125
Date: Fri, 23 Apr 2010 13:51:32 -0800 (PST)
From: zerdna az...@yahoo.com
To: r-help@r-project.org
Subject: [R] bigmemory package woes
Message-ID: 1272059492009-2062996.p...@n4.nabble.com
Content-Type: text/plain; charset=us-ascii


I have pretty big data sizes, like matrices of .5 to 1.5GB so once i need to
juggle several of them i am in need of disk cache. I am trying to use
bigmemory package but getting problems that are hard to understand. I am
getting seg faults and machine just hanging. I work by the way on Red Hat
Linux, 64 bit R version 10.

Simplest problem is just saving matrices. When i do something like

r-matrix(rnorm(100), nr=10); librarybigmemory)
for(i in 1:3) xx-as.big.matrix(r, backingfile=paste(r,i, sep=,
collapse=), backingpath=MyDirName)

it works just fine -- saves small matrices  as three different matrices on
disc. However, when i try it with real size, like

with r-matrix(normal(5000), nr=1000)

I am either getting seg fault on saving the third big matrix, or hang
forever.

Am i doing something obviously wrong, or is it an unstable package at the
moment? Could anyone recommend something similar that is reliable in this
case?


-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Huge data sets and RAM problems

2010-04-20 Thread Jay Emerson

Stella,

A few brief words of advice:

1. Work through your code a line at a time, making sure that each is what
you would expect.  I think some of your later problems are a result of
something
early not being as expected.  For example, if the read.delim() is in fact
not
giving you what you expect, stop there before moving onwards.  I suspect
some funny character(s) or character encodings might be a problem.

2. 32-bit Windows can be limiting. With 2 GB of RAM, you're probably not
going to be able to work effectively in native R with objects over 200-300
MB,
and the error indicates that something (you or a package you're using)
simply
have run out of memory.  So...

3. Consider more RAM (and preferably with 64-bit R).  Other solutions might
be possible, such as using a database to hand the data transition into R.
2.5 million rows by 18 columns is apt to be around 360 MB.  Although you
can afford 1 (or a few) copies of this, it doesn't leave you much room for
the memory overhead of working with such an object.

Part of the oringal message below.

Jay

-

Message: 80
Date: Mon, 19 Apr 2010 22:07:03 +0200
From: Stella Pachidi stella.pach...@gmail.com
To: r-h...@stat.math.ethz.ch
Subject: [R]  Huge data sets and RAM problems
Message-ID:
   g2j133363581004191307t2a48c1bfqd9d57cf0d6c62...@mail.gmail.com
Content-Type: text/plain; charset=ISO-8859-1

Dear all,



I am using R 2.10.1 in a laptop with Windows 7 - 32bit system, 2GB RAM
and CPU Intel Core Duo 2GHz.

.

Finally, another problem I have is when I perform association mining
on the data set using the package arules: I turn the data frame into
transactions table and then run the apriori algorithm. When I put too
low support in order to manage to find the rules I need, the vector of
rules becomes too big and I get problems with the memory such as:
Error: cannot allocate vector of size 923.1 Mb
In addition: Warning messages:
1: In items(x) : Reached total allocation of 153Mb: see help(memory.size)

Could you please help me with how I could allocate more RAM? Or, do
you think there is a way to process the data by loading them into a
document instead of loading all into RAM? Do you know how I could
manage to read all my data set?

I would really appreciate your help.

Kind regards,
Stella Pachidi


-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] large dataset

2010-03-27 Thread Jay Emerson

A little more information would help, such as the number of columns?  I
imagine it must
be large, because 100,000 rows isn't overwhelming.  Second, does the
read.csv() fail,
or does it work but only after a long time?  And third, how much RAM do you
have
available?

R Core provides some guidelines in the Installation and Administration
documentation
that suggests that a single object around 10% of your RAM is reasonable, but
beyond
that things can become challenging, particularly once you start working with
your data.

There are a wide range of packages to help with large data sets.  For
example,
RMySQL supports MySQL databases.  At the other end of the spectrum, there
are
possibilities discussed on a nice page by Dirk Eddelbuettel which you might
look at:

http://cran.r-project.org/web/views/HighPerformanceComputing.html

Jay

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

(original message below)
--

Message: 128
Date: Sat, 27 Mar 2010 10:19:33 +0100
From: n\.vial...@libero\.it n.via...@libero.it
To: r-help r-help@r-project.org
Subject: [R] large dataset
Message-ID: kzxokl$991aa2d6c95c3bd9f464c3b32b78b...@libero.it
Content-Type: text/plain; charset=iso-8859-1

Hi I have a question,
as im not able to import a csv file which contains a big dataset(100.000
records) someone knows how many records R can handle without giving
problems?
What im facing when i try to import the file is that R generates more than
100.000 records and is very slow...
thanks a lot!!!

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Mosaic plots

2010-03-23 Thread Jay Emerson

As pointed out by others, vcd supports mosaic plots on top of the grid
engine (which is extremely helpful for those of us who love playing around
with grid).  The standard mosaicplot() function is directly available (it
isn't clear if you knew this).  The proper display of names is a real
challenge faced by all of us with these plots, so  you should try each
version.  I'm not sure what you intend to do with a legend, but if you want
the ability to customize and hack code, I suggest you look at grid and a
modification to vcd's version to suit your purposes.

Jay





 Subject: [R] Mosaic Plots
 Message-ID: 1269256874432-1677468.p...@n4.nabble.com
 Content-Type: text/plain; charset=us-ascii


 Hello Everyone

 I want to plot Moasic Plots, I have tried them using iplots package (using
 imosaic). The problem is the names dont get alligned properly, is there a
 way to a align the names and provide legend in Mosaic plots using R?

 Also I would like to know any other packages using which I can plot Mosaic
 Plots


 Thank you in advance
 Sunita
 --


 --
 John W. Emerson (Jay)
 Associate Professor of Statistics
 Department of Statistics
 Yale University
 http://www.stat.yale.edu/~jay http://www.stat.yale.edu/%7Ejay




-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] question about bigmemory: releasing RAM from a big.matrix that isn't used anymore

2010-02-06 Thread Jay Emerson

 See inline for responses.  But people are always welcome to contact
 us directly.

Hi all,

I'm on a Linux server with 48Gb RAM. I did the following:

x -
big.matrix(nrow=2,ncol=50,type='short',init=0,dimnames=list(1:2,1:50))
#Gets around the 2^31 issue - yeah!

 We strongly discourage use of dimnames.

in Unix, when I hit the top command, I see R is taking up about 18Gb
RAM, even though the object x is 0 bytes in R. That's fine: that's how
bigmemory is supposed to work I guess. My question is how do I return
that RAM to the system once I don't want to use x any more? E.g.,

rm(x)

then top in Unix, I expect that my RAM footprint is back ~0, but it
remains at 18Gb. How do I return RAM to the system?

 It can take a while for the OS to free up memory, even after a gc().
 But it's available for re-use; if you want to be really sure, have a
look
 in /dev/shm to make sure the shared memory segments have been
 deleted.

Thanks,

Matt

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay http://www.stat.yale.edu/%7Ejay

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Multicore package: sharing/modifying variable accross processes

2009-10-31 Thread Jay Emerson

Renaud,

Package bigmemory can help you with shared-memory matrices, either in RAM or
filebacked.  Mutex support currently exists as part of the package, although
for various reasons will soon be abstracted from the package and provided
via a new package, synchronicity.

bigmemory works beautifully with multicore.  Feel free to email us with
questions, and we appreciate feedback.

Jay



Original message:

Hi,

I want to parallelize some computations when it's possible on multicore
machines.
Each computation produces a big objects that I don't want to store if
not necessary: in the end only the object that best fits my data have to
be returned. In non-parallel mode, a single gloabl object is updated if
the current computation gets a better result than the best previously found.
My plan was to use package multicore. But there is obviously an issue of
concurrent access to the global result variable.
Is there a way to implement something like a lock/mutex to ensure make
the procedure thread safe?
Maybe something already exist to deal with such things?
It looks like package multicore run the different processes in different
environments with copy-on-change of everything when forking. Anybody has
experimented working with a shared environment with package multicore?



-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Estimation in a changepoint regression with R

2009-10-16 Thread Jay Emerson

Package bcp does Bayesian changepoint analysis, though not in the
general regression
framework.  The most recent reference is Bioinformatics 24(19) 2143-2148; doi:
  10.1093/bioinformatics/btn404; slightly older is JSS 23(3).  Both
reference some
alternatives you might want to consider (including strucchange, among others).


Jay



Message: 4
Date: Thu, 15 Oct 2009 03:56:22 -0700 (PDT)
From: FMH kagba2...@yahoo.com
Subject: [R] Estimation in a changepoint regression with R
To: r-help@r-project.org
Message-ID: 365399.56401...@web38303.mail.mud.yahoo.com
Content-Type: text/plain; charset=iso-8859-1

Dear All,

I'm trying to do the estimation in a changepoint regression problem
via R, but never found any suitable function which might help me to do
this.

Could someone give me a hand?on this matter?

Thank you.

--
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] reading web log file into R

2009-09-23 Thread Jay Emerson

Sebastian,

There is rarely a completely free lunch, but fortunately for us R has
some wonderful tools
to make this possible.  R supports regular expressions with commands
like grep(),
gsub(), strsplit(), and others documented on the help pages.  It's
just a matter of
constructing and algorithm that does the job.  In your case, for
example (though please
note there are probably many different, completely reasonable approaches in R):

x - scan(logfilename, what=, sep=\n)

should give you a vector of character strings, one line per element.  Now, lines
containing GET seem to identify interesting lines, so

x - x[grep(GET, x)]

should trim it to only the interesting lines.  If you want information
from other lines, you'll
have to treat them separately.  Next, you might try

y - strsplit(x)

which by default splits on whitespace, returning a list (one component
per line) of vectors
based on the split.  Try it.  It it looks good, you might check

lapply(y, length)

to see if all lines contain the same number of records.  If so, you
can then get quickly into
a matrix,

z - matrix(unlist(strsplit(x)), ncol=K, byrow=TRUE)

where K is the common length you just observed.  If you think this is
cool, great!  If not, well...
hire a programmer, or if you're lucky Microsoft or Apache have tools
to help you with this.
There might be something in the Perl/Python world.  Or maybe there's a
package in R designed
just for this, but I encourage students to develop the raw skills...

Jay



-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] kmeans.big.matrix

2009-07-22 Thread Jay Emerson

This sort of question is ideal to send directly to the maintainer.

We've removed kmeans.big.matrix for the time being and will place it in a
new package, bigmemoryAnalytics.  bigmemory itself is the core building
block and tool, and we don't want to pollute it with lots of extras.

Allan's point is right: big data packages (like bigmemory and ff) can't
be used directly with R functions (like lm).  And because of R's design you
can't extract subsets with more than 2^31-1 elements, even though the
big.matrix can be as large as you need (with filebacking).

I hope that helps.

Jay

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Building a big.matrix using foreach

2009-07-19 Thread Jay Emerson

Michael,

If you have a big.matrix, you just want to iterate over the rows.  I'm not
in R and am just making this up on the fly (from a bar in Beijing, if you
believe that):

foreach(i=1:nrow(x),.combine=c) %dopar% f(x[i,])

should work, essentially applying the functin f() to the rows of x?  But
perhaps I misunderstand you.  Please feel free to email me or Mike (
michael.k...@yale.edu) directoy with questions about bigmemory, we are very
interested in applications of it to real problems.

Note that the package foreach uses package iterators, and is very flexible,
in case you need more general iteration in parellel.

Regards,

Jay



Original message:
Hi there!
I have become a big fan of the 'foreach' package allowing me to do a
lot of stuff in parallel. For example, evaluating the function f on
all elements in a vector x is easily accomplished:
foreach(i=1:length(x),.combine=c) %dopar% f(x[i])
Here the .combine=c option tells foreach to combine output using the
c()-function. That is, to return it as a vector.
Today I discovered the 'bigmemory' package, and I would like to
contruct a big.matrix in a parralel fashion row by row. To use foreach
I see no other way than to come up with a substitute for c in the
.combine option. I have checked out the big.matrix manual, but I can't
find a function suitable for just that.
Actually, I wouldn't even know how to do it for a usual matrix. Any clues?
Thanks!
--
Michael Knudsen
micknud...@gmail.com
http://lifeofknudsen.blogspot.com/



-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] [R-pkgs] Major bigmemory revision released.

2009-04-17 Thread Jay Emerson

The re-engineered bigmemory package is now available (Version 3.5
and above) on CRAN.  We strongly recommend you cease using
the older versions at this point.

bigmemory now offers completely platform-independent support for
the big.matrix class in shared memory and, optionally, as filebacked
matrices for larger-than-RAM applications.  We're working on updating
the package vignette, and a draft is available upon request (just send
me an email if you're interested).  The user interface is largely unchanged.

Feedback, bug reports, etc... are welcome.

Jay Emerson  Michael Kane

-- 
John W. Emerson (Jay)
Assistant Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

[[alternative HTML version deleted]]

___
R-packages mailing list
r-packa...@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Using very large matrix

2009-03-02 Thread Jay Emerson

Steve et.al.,

The old version is still on CRAN, but I strongly encourage anyone
interested to email me directly and I'll make the new version available.
In fact, I wouldn't mind just pulling the old version off of CRAN, but of course
that's not a great idea.  !-)

Jay


On Mon, Mar 2, 2009 at 8:47 AM,  steve_fried...@nps.gov wrote:
 I'm very interested in the bigmemory package for windows 32-bit
 environments.  Who do I need to contact to request the Beta version?

 Thanks
 Steve

 Steve Friedman Ph. D.
 Spatial Statistical Analyst
 Everglades and Dry Tortugas National Park
 950 N Krome Ave (3rd Floor)
 Homestead, Florida 33034

 steve_fried...@nps.gov
 Office (305) 224 - 4282
 Fax     (305) 224 - 4147



             Corrado
             ct...@york.ac.uk
                                                                       To
             Sent by:                  john.emer...@yale.edu, Tony Breyal
             r-help-boun...@r-         tony.bre...@googlemail.com
             project.org                                                cc
                                       r-help@r-project.org
                                                                   Subject
             03/02/2009 10:46          Re: [R] Using very large matrix
             AM GMT









 Thanks a lot!

 Unfortunately, the R package I have to sue for my research was only
 released
 on 32 bit R on 32 bit MS Windows and only closed source   I normally
 use
 64 bit R on 64 bit Linux  :)

 I tried to use the bigmemory in cran with 32 bit windows, but I had some
 serious problems.

 Best,

 On Thursday 26 February 2009 15:43:11 Jay Emerson wrote:
 Corrado,

 Package bigmemory has undergone a major re-engineering and will be
 available soon (available now in Beta version upon request).  The version
 currently on CRAN
 is probably of limited use unless you're in Linux.

 bigmemory may be useful to you for data management, at the very least,
 where

 x - filebacked.big.matrix(8, 8, init=n, type=double)

 would accomplish what you want using filebacking (disk space) to hold
 the object.
 But even this requires 64-bit R (Linux or Mac, or perhaps a Beta
 version of Windows 64-bit
 R that REvolution Computing is working on).

 Subsequent operations (e.g. extraction of a small portion for analysis)
 are
 then easy enough:

 y - x[1,]

 would give you the first row of x as an object y in R.  Note that x is
 not itself an R matrix,
 and most existing R analytics can't work on x directly (and would max
 out the RAM if they
 tried, anyway).

 Feel free to email me for more information (and this invitation
 applies to anyone who is
 interested in this).

 Cheers,

 Jay

 #Dear friends,
 #
 #I have to use a very large matrix. Something of the sort of
 #matrix(8,8,n)  where n is something numeric of the sort
 0.xx #
 #I have not found a way of doing it. I keep getting the error
 #
 #Error in matrix(nrow = 8, ncol = 8, 0.2) : too many elements
 specified #
 #Any suggestions? I have searched the mailing list, but to no avail.
 #
 #Best,
 #--
 #Corrado Topi
 #
 #Global Climate Change  Biodiversity Indicators
 #Area 18,Department of Biology
 #University of York, York, YO10 5YW, UK
 #Phone: + 44 (0) 1904 328645, E-mail: ct...@york.ac.uk


 --
 Corrado Topi

 Global Climate Change  Biodiversity Indicators
 Area 18,Department of Biology
 University of York, York, YO10 5YW, UK
 Phone: + 44 (0) 1904 328645, E-mail: ct...@york.ac.uk

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.






-- 
John W. Emerson (Jay)
Assistant Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Using very large matrix

2009-02-26 Thread Jay Emerson

Corrado,

Package bigmemory has undergone a major re-engineering and will be available
soon (available now in Beta version upon request).  The version
currently on CRAN
is probably of limited use unless you're in Linux.

bigmemory may be useful to you for data management, at the very least, where

x - filebacked.big.matrix(8, 8, init=n, type=double)

would accomplish what you want using filebacking (disk space) to hold
the object.
But even this requires 64-bit R (Linux or Mac, or perhaps a Beta
version of Windows 64-bit
R that REvolution Computing is working on).

Subsequent operations (e.g. extraction of a small portion for analysis) are then
easy enough:

y - x[1,]

would give you the first row of x as an object y in R.  Note that x is
not itself an R matrix,
and most existing R analytics can't work on x directly (and would max
out the RAM if they
tried, anyway).

Feel free to email me for more information (and this invitation
applies to anyone who is
interested in this).

Cheers,

Jay

#Dear friends,
#
#I have to use a very large matrix. Something of the sort of
#matrix(8,8,n)  where n is something numeric of the sort 0.xx
#
#I have not found a way of doing it. I keep getting the error
#
#Error in matrix(nrow = 8, ncol = 8, 0.2) : too many elements specified
#
#Any suggestions? I have searched the mailing list, but to no avail.
#
#Best,
#--
#Corrado Topi
#
#Global Climate Change  Biodiversity Indicators
#Area 18,Department of Biology
#University of York, York, YO10 5YW, UK
#Phone: + 44 (0) 1904 328645, E-mail: ct...@york.ac.uk



-- 
John W. Emerson (Jay)
Assistant Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] R package building

2008-05-17 Thread Jay Emerson

I agree with others that the packaging system is generally easy to
use, and between the Writing R Extensions documentation and other
scattered sources (including these lists) there shouldn't be many
obstacles.  Using package.skeleton() is a great way to get started:
I'd recommend just having one data object and one new function in the
session for starters.  You can build up from there.

I've only ran into time-consuming walls on more advanced, obscure
issues.  For example: the Suggests: field in DESCRIPTION generated
quite some debate back in 2005, but until I found that thread in the
email lists I didn't understand the issue.  For completeness, I'll
round out this discussion, hoping I'm correct.  In essence, I think
the choice of the word Suggests: was intended for the package user,
not for the developer.  The user isn't required to have a suggested
package in order to load and use the desired package.  But the
developer is required (in the R CMD check) to have the suggested
package in order to avoid warnings or fails.  This does, actually,
make sense, because we assume a developer would want/need to check
features that involve the suggested package.  In a few isolated cases
(I think I had one of them), this caused a problem, where a desired
suggested package isn't distributed by CRAN on all platforms, so I
would risk getting into trouble with R CMD check on the platform
without the suggested package.  But this is pretty obscure, and the
issue was obviously well-debated in the past.  The addition of a line
or two about this in Writing R Extensions would be friendly (the
current content is correct and minimal sufficient I believe).  Maybe I
should draft this and submit it to the group.

Secondly, I would advice a newbie to the packaging system to avoid S4
at first.  Ultimately, I think it's pretty cool.  But, for example,
documentation on proper documentation (to handle the man pages
correctly) has puzzled me, and even though I can create a package with
S4 that passes R CMD check cleanly, I'm not convinced I've got it
quite right.  If someone has recently created more documentation or a
'white pages' on this, please do spread the word.

Thanks to all who have -- and continue -- to work on the system!

Jay

Subject: [R] R package building

In a few days I'll give a talk on R package development and my
personal experience, at the 3rd Free / Libre / Open Source Software
(FLOSS) Conference which will take place on May 27th  28th 2008, in
the National Technical University of Athens, in Greece.

I would appreciate if you could share
your thoughts with me; What are today's obstacles on R package
building, according to your
opinion and personal experience.

Thanks,
--
Angelos I. Markos
Scientific Associate, Dpt of Exact Sciences, TEI of Thessaloniki, GR
I'm not an outlier; I just haven't found my distribution yet



-- 
John W. Emerson (Jay)
Assistant Professor of Statistics
Director of Graduate Studies
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] R on a computer cluster

2008-02-17 Thread Jay Emerson

Gabriele,

In addition to the suggestions from Markus (below), there is
NetWorkSpaces (package nws).  I have used both nws and snow together
with a package I'm developing (bigmemoRy) which allocates matrices to
shared memory (helping avoid the bottleneck Markus alluded to for
processors on the same computer).  Both seem quite easy to use,
essentially only needing one command to initiate the cluster and
then one command to do something like apply() in parallel.  It takes a
little planning of your application, but the painfully obvious
parallel problem should be painless to implement.

Jay




Hi,

your required performance is strongly depending on your application.
If you talk about a cluster, you should think about several computers.
Not only one computer with several processors.

If you have several computers. First of all you have to decide for a
communication protocol for parallel computing: MPI, PVM, ...
Then you have to install this at your computers. I think you should use
MPI and one of its implementations: OpenMPI, LamMPI
Then there are several R packages for using the communication protocols:
Rmpi, snow, Rpvm, ...

If you have one computer with severals processors, you can do the same
thinks. But then you have only shared memory (bottleneck) and there is
not to much improvement in performance. R is not yet implemented for
multiple-processors. There is one first, experimental R package using
openMP for multi threading: pnmath
(http://www.stat.uiowa.edu/~luke/R/experimental/)

Some useful links:
http://www.stats.uwo.ca/faculty/yu/Rmpi/
http://ace.acadiau.ca/math/ACMMaC/Rmpi/
http://www.open-mpi.org/
http://www.personal.leeds.ac.uk/~bgy1mm/MPITutorial/MPIHome.html

Best regards
Markus

[EMAIL PROTECTED] schrieb:
 Dear all,

 I usually run  R on my laptop with Windows XP Professional.
 Now I really want to run  R on a computer cluster (4 processors) with
 Suse Linux Enterprise ver. 10.   But I  am new with computer cluster.


 Should I modify my functions in order to use the greater
 performance
 and availability than that provided by my laptop?


 Is there any R
 manual  on parallel computations on multiple-processor?
 Any suggestion
 on a basic tutorial on this topic?

 Thank you.



-- 
John W. Emerson (Jay)
Assistant Professor of Statistics
Director of Graduate Studies
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay
REvolution Computing, Statistical Consultant

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Memory problem?

2008-01-31 Thread Jay Emerson

Elena,

Page 23 of the R Installation Guide provides some memory guidelines
that you might find helpful.

There are a few things you could try using R, at least to get up and running:

- Look at fewer tumors at a time using standard R as you have been.
- Look at the ff package, which leaves the data in flat files with
memory mapped pages.
- It may be that package filehash does something similar using a
database (I'm less familiar with this).
- Wait for the upcoming package bigmemoRy package, which is designed
to place large objects like this in RAM (using C++) but gives you a
close-to-seamless interaction with it from R.  Caveat below.

With any of these options, you are still very much restricted by the
type of analysis you are attempting.  Almost any existing procedure
(e.g. a cox model) would need a regular R object (probably impossible)
and you are back to square one.  An exception to this is Thomas
Lumley's biglm package, which processes the data in chunks.  We need
more tools like these.  Ultimately, you'll need to find some method of
analysis that is pretty smart memory-wise, and this may not be easy.

Best of luck,

Jay

-
Original message:

I am trying to run a cox model for the prediction of relapse of 80 cancer
tumors, taking into account the expression of 17000 genes. The data are
large and I retrieve an error:
Cannot allocate vector of 2.4 Mb. I increase the memory.limit to 4000
(which is the largest supported by my computer) but I still retrieve the
error because of other big variables that I have in the workspace. Does
anyone know how to overcome this problem?

Many thanks in advance,
Eleni


-- 
John W. Emerson (Jay)
Assistant Professor of Statistics
Director of Graduate Studies
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay
Statistical Consultant, REvolution Computing

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

37 matches

Mail list logo