RE: [julia-stats] DataFrame and Memory Limitations

David Anthoff Thu, 29 Sep 2016 15:04:28 -0700

Microsoft at some point had DryadLINQ, which allowed one to run LINQ queries on 
a distributed cluster. Given that Query.jl is modeled very much after LINQ I’m 
sure one could write a query provider for Query.jl that did something similar, 
i.e. maybe a front-end for Dagger.jl that translates Query.jl queries into 
dagger computations. Having said that, I have not looked into any of these in 
detail, so what I just wrote might be completely off. Would certainly be a fun 
project for someone, though!


 

From: [email protected] [mailto:[email protected]] On 
Behalf Of dalinkman
Sent: Thursday, September 29, 2016 7:53 AM
To: julia-stats <[email protected]>
Subject: Re: [julia-stats] DataFrame and Memory Limitations

 

What about using a tuple of distributed vectors/arrays as table subclass, or 
using dagger for an out of core lazy array.

Then it can be loaded into a distributed array for linear algebra. 

On Thursday, September 29, 2016 at 4:33:21 AM UTC-4, Milan Bouchet-Valat wrote:

We're not completely there yet, but with Query.jl and 
StructuredQueries.jl, combined with JuliaDB/JuliaData packages, one 
should be able to work on out-of-memory data sets as (or more) 
efficiently as e.g. SAS. The high-level API is the same whether you 
work on a DataFrame or on an external data base. 

There's also OnlineStats.jl for computing statistics without loading 
the full data set in memory at once. 


Regards 


Le mercredi 28 septembre 2016 à 15:48 -0700, Juan a écrit : 
> Yes, but you can only do simple things such as summaries or use functions 
> implemented on that special packages. You can do linear regression, till now 
> but you can't  more complex things such as mixed effect regression or use 
> stan nor any other generic bayesian package. 
> The same goes for Spark, you can only use predefined functions, very simple 
> ones, or create your own by hand, but it's very difficult that you can 
> program from scratch something like lme4. 
> 
> > > > Hi I don't know Julia, but in R you don't need to load all data into  
> > > > memory just like SAS you can read off disk, in R both proprietary 
> > > > Revolutionary Analytics R I think working with Hortonworks/Cloudera and 
> > > > Hadoop and Yarn (I don't know if there is a Julia package for Yarn?, I 
> > > > know little of Hadoop  and [not really interested in Java ] and Yarn  
> > > > so I suggest you contact someone at Hortonworks or Revolution R) g  
> > > > which I saw a demonstration of in R User group here in Ottawa, Canada 
> > > > as well as Revolution R's other proprietary methods  and bigmemory  
> > > > http://cran.r-project.org/web/packages/bigmemory/index.html and 
> > > > http://www.bigmemory.org/ can handle more data. I Here is a discussion 
> > > > on large size data. 
> > https://groups.google.com/forum/#!topic/julia-stats/eqYT85_vUlg 
> > Regards, 
> > Ramesh 
> > 
> > 
> > > > On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <[email protected] 
> > > > <mailto:[email protected]> > wrote: 
> > > All, 
> > > 
> > > Are there currently any solutions in Julia to handle larger-than-memory 
> > > datasets in a similar way you do in a DataFrame? 
> > > 
> > > The reason I'm asking is that R has the limitation that you need to fit 
> > > all your data into memory. On the other hand, SAS (while being quite 
> > > different) does not have this limitations. 
> > > 
> > > In the age of "big data" this can be quite an advantage. 
> > > 
> > > Of course, you can "patch" this situation, e.g. in R you can use the ff 
> > > or bigmemory packages, or use SQL. 
> > > 
> > > But my point is that it is bolted on, and you need to spend extra mental 
> > > loops switching between, say, data.frame and ff, instead of focusing on 
> > > your data problem at hand. This is a clear advantage of SAS, where you 
> > > don't have to do that. So I'm wondering how this is handled in Julia. 
> > > 
> > > Thanks, 
> > > 
> > > M 
> > > 
> > > P.S.: I do not intend to start a flame war, e.g. whether R or SAS or 
> > > Julia is better. I'm just interested to find out whether such a solution 
> > > exists in Julia (I haven't found any, but maybe I overlooked something). 
> > > And if no such solution exists, given that Julia is still young, 
> > > evolving, and malleable (in a positive sense), it might make sense to 
> > > think about it. 
> > > 
> > > -- 
> > > You received this message because you are subscribed to the Google Groups 
> > > "julia-stats" group. 
> > > > > > To unsubscribe from this group and stop receiving emails from it, 
> > > > > > send an email to [email protected] 
> > > > > > <mailto:[email protected]> . 
> > > > > > For more options, visit https://groups.google.com/d/optout. 
> > > 
> > 
> > 
> --  
> You received this message because you are subscribed to the Google Groups 
> "julia-stats" group. 
> > To unsubscribe from this group and stop receiving emails from it, send an 
> > email to [email protected] <javascript:> . 
> > For more options, visit https://groups.google.com/d/optout. 

-- 
You received this message because you are subscribed to the Google Groups 
"julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected] 
<mailto:[email protected]> .
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

RE: [julia-stats] DataFrame and Memory Limitations

Reply via email to