Re: [Rd] arbitrary size data frame or other stcucts, curious about issues invovled.

Mike Marchywka Tue, 21 Jun 2011 10:29:43 -0700

> 
> Mike,
> 
> this is all nice, but AFAICS the first part misses the point that there is no 
> 64-bit integer type in the API so there is simply no alternative at the 
> moment. You just said that you don't like it, but you failed to provide a 
> solution ... As for the second part, the idea is not new and is noble, but 
> AFAIK no one so far was able to draft any good proposal as of what the API 
> would look like. It would be very desirable if someone did, though. (BTW your 
> link is useless - linking google searches is pointless as the results vary by 
> request location, user setting etc.).

I guess in reverse order, the google link is intended for convenience for those 
interested as I could
not find a specific link and didn't expect much spam to be there ( "its all 
good" ) so results may
not be preidctable but just like floating point should be close enough for the 
curious analyst.
I'm not trying to provide a solution until I understand the problem. 

There are many issues with "big data" and I'll try to explain my concerns but 
they require
talking about them in a bit of an integrated way to see how they relate  and to 
see if my
understandings are correct about R ( before I dig into it, want to look for the 
right things). 


The 32 bit int still has cardinality of multi-gigs and there are issues of 
indexes versus memory size.
A typical data frame may point to thousands of rows with many colums of mixed 
type, non being
less than 4 bytes of content. So, to simply avoid using up phyiscal memory I 
would not think
the 32 bit issue is a limitation, certainly a square array already has the 64 
bit pointer to
a given element ( 32*2LOL). An arbitrary size frame, up to the limits of the 
indexing,
could easily exceed physical memory but as I understand it R can bomb at that 
point or 
even with VM have speed issues.  

Simply being able to select the storage order could be a big deal depending on
the access pattern: rows, columns bit reversed, etc. This could prevent VM 
thrashing
well before you hit a 32 bit API limit and be transparent beyond adding a new 
ctor method.
And in fact you may have many larger operands, you may want to tell a give df 
subclass
to ONLY keep so much in physical memory at a time. Resource contention and 
starvation,
fighting for food(data) can be a bottleneck.


data.frame( storage="bit_reversed", physical_mem_limit="some absolute or 
relative thing here").




In any case, you may be able to imagine adding something like a paging method 
to a 32 bit
api for example that would be transparent to small data sets although I'd have 
to give it some thought.
This would only make  sense in cases where aceesses tend to occur in blocks
but this could be a lot of situations. 

I guess I can look at the big memory and related classes for some idea of what 
is going
on here. 

For purely sequential access I guess I was looking for some kind of streaming
data source and then anything related to size may be well contained.








> 
> Cheers,
> Simon
> 
> 
> On Jun 21, 2011, at 6:33 AM, Mike Marchywka wrote:
> 
> > Thanks,
> > 
> > http://cran.r-project.org/doc/manuals/R-ints.html#Future-directions
> > 
> > Normally I'd take more time to digest these things before commenting but
> > a few things struck me right away. First, use of floating point or double 
> > as a replacement for int strikes me as "going the wrong way" as often
> > to get predictable performance you try to tell the compiler you have
> > ints rather than any floating time for which it is free to "round."  This
> > is even ignoring any performance issue. The other thing is that scaling
> > should not just be an issue of "make everything bigger" as the growth in
> > both data needs and computer resources is not uniform. 
> > 
> > I guess my first thought to these constraints and resource issues
> > is to consider a paged dataframe depending upon the point at which
> > the 32-bit int constraint is imposed. A random access data struct 
> > does not always get accessed randomly, and often it is purely sequential.
> > Further down the road, it would be nice if algorithms were implemented in a
> > block mode or could communicate their access patterns to the ds or
> > at least tell it to prefetch things that should be needed soon. 
> > 
> > I guess I'm thinking mostly along the lines of things I've seen from Intel
> > such as ( first things I could find on their site as I have not looked in 
> > detail
> > in quite a while),
> > 
> > 
> > http://www.google.com/search?hl=en&source=hp&q=site%3Aintel.com+performance+optimization
> > 
> > as once you get around thrashing virtual memory, you'd like to preserve the
> > lower level memory cache hit rates too etc. These are probably not just 
> > niceities, 
> > at least with VM, as personally I've seen impl related speed issues make 
> > simple analyses impractical.
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >> Subject: RE: arbitrary size data frame or other stcucts, curious about 
> >> issues invovled.
> >> From: jayemer...@gmail.com
> >> To: marchy...@hotmail.com; r-devel@r-project.org
> >> 
> >> Mike,
> >> 
> >> 
> >> Neither bigmemory nor ff are "drop in" solutions -- though useful,
> >> they are primarily for data storage and management and allowing
> >> convenient access to subsets of the data.  Direct analysis of the full
> >> objects via most R functions is not possible.  There are many issues
> >> that could be discussed here (and have, previously), including the use
> >> of 32-bit integer indexing.  There is a nice section "Future
> >> Directions" in the R Internals manual that you might want to look at.
> >> 
> >> Jay
> >> 
> >> 
> >> -------------------------------------  Original message:
> >> 
> >> We keep getting questions on r-help about memory limits  and
> >> I was curious to know what issues are involved in making
> >> common classes like dataframe work with disk and intelligent
> >> swapping? That is, sure you can always rely on OS for VM
> >> but in theory it should be possible to make a data structure
> >> that somehow knows what pieces you will access next and
> >> can keep thos somewhere fast. Now of course algorithms
> >> "should" act locally and be block oriented but in any case
> >> could communicate with data structures on upcoming
> >> access patterns, see a few ms into the future and have the
> >> right stuff prefetched.
> >> 
> >> I think things like "bigmemory" exist but perhaps one
> >> issue was that this could not just drop in for data.frame
> >> or does it already solve all the problems?
> >> 
> >> Is memory management just a non-issue or is there something
> >> that needs to be done  to make large data structures work well?
> >> 
> >> 
> >> -- 
> >> John W. Emerson (Jay)
> >> Associate Professor of Statistics
> >> Department of Statistics
> >> Yale University
> >> http://www.stat.yale.edu/~jay
> >                                       
> >     [[alternative HTML version deleted]]
> > 
> > ______________________________________________
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> > 
> > 
> 
                                          
        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] arbitrary size data frame or other stcucts, curious about issues invovled.

Reply via email to