RE: [julia-stats] DataFrame and Memory Limitations

David Anthoff Thu, 29 Sep 2016 14:59:58 -0700

Yes, at least in theory it should be possible to e.g. load a very large CSV 
file with CSV.jl, transform it with Query.jl and then feed it into 
OnlineStats.jl. I think the architecture of all three packages should be such 
that this could work with a dataset that is larger than memory. In practice I 
don't think anyone has tried and I'm sure we would run into things that need 
fixing, but I can't think of some basic design decision in any of these 
packages that would prevent this kind of thing in principle.


There is a general question of the core interop type for these things. Right 
now things like regression packages mostly expect a DataFrame. But we could 
imagine a world where these packages expected a more generic type. I think 
right now there are a bunch of potential options out there: both DataStreams 
and Query define their own streaming interfaces for tabular data (in the case 
of Query it is just a normal julia iterator that returns NamedTuple elements). 
DataStreams in addition defines a column based interface that might be much 
faster when the dataset actually fits into memory (pure speculation on my end). 
I think there are also a bunch of attempts out there to define something like 
an abstract table structure, but I'm not sure to what extend they would enable 
a streaming data story.

> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Milan Bouchet-Valat
> Sent: Thursday, September 29, 2016 1:33 AM
> To: [email protected]
> Subject: Re: [julia-stats] DataFrame and Memory Limitations
> 
> We're not completely there yet, but with Query.jl and StructuredQueries.jl,
> combined with JuliaDB/JuliaData packages, one should be able to work on
> out-of-memory data sets as (or more) efficiently as e.g. SAS. The high-level
> API is the same whether you work on a DataFrame or on an external data
> base.
> 
> There's also OnlineStats.jl for computing statistics without loading the full
> data set in memory at once.
> 
> 
> Regards
> 
> 
> Le mercredi 28 septembre 2016 à 15:48 -0700, Juan a écrit :
> > Yes, but you can only do simple things such as summaries or use functions
> implemented on that special packages. You can do linear regression, till now
> but you can't  more complex things such as mixed effect regression or use
> stan nor any other generic bayesian package.
> > The same goes for Spark, you can only use predefined functions, very
> simple ones, or create your own by hand, but it's very difficult that you can
> program from scratch something like lme4.
> >
> > > > > Hi I don't know Julia, but in R you don't need to load all data
> into  memory just like SAS you can read off disk, in R both proprietary
> Revolutionary Analytics R I think working with Hortonworks/Cloudera and
> Hadoop and Yarn (I don't know if there is a Julia package for Yarn?, I know
> little of Hadoop  and [not really interested in Java ] and Yarn  so I suggest 
> you
> contact someone at Hortonworks or Revolution R) g  which I saw a
> demonstration of in R User group here in Ottawa, Canada as well as
> Revolution R's other proprietary methods  and bigmemory  http://cran.r-
> project.org/web/packages/bigmemory/index.html
> and http://www.bigmemory.org/ can handle more data. I Here is a
> discussion on large size data.
> > > https://groups.google.com/forum/#!topic/julia-stats/eqYT85_vUlg
> > > Regards,
> > > Ramesh
> > >
> > >
> > > > > On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <[email protected]>
> wrote:
> > > > All,
> > > >
> > > > Are there currently any solutions in Julia to handle
> > > > larger-than-memory datasets in a similar way you do in a DataFrame?
> > > >
> > > > The reason I'm asking is that R has the limitation that you need
> > > > to fit all your data into memory. On the other hand, SAS (while
> > > > being quite
> > > > different) does not have this limitations.
> > > >
> > > > In the age of "big data" this can be quite an advantage.
> > > >
> > > > Of course, you can "patch" this situation, e.g. in R you can use
> > > > the ff or bigmemory packages, or use SQL.
> > > >
> > > > But my point is that it is bolted on, and you need to spend extra
> > > > mental loops switching between, say, data.frame and ff, instead of
> > > > focusing on your data problem at hand. This is a clear advantage
> > > > of SAS, where you don't have to do that. So I'm wondering how this is
> handled in Julia.
> > > >
> > > > Thanks,
> > > >
> > > > M
> > > >
> > > > P.S.: I do not intend to start a flame war, e.g. whether R or SAS
> > > > or Julia is better. I'm just interested to find out whether such a
> > > > solution exists in Julia (I haven't found any, but maybe I overlooked
> something).
> > > > And if no such solution exists, given that Julia is still young,
> > > > evolving, and malleable (in a positive sense), it might make sense
> > > > to think about it.
> > > >
> > > > --
> > > > You received this message because you are subscribed to the Google
> Groups "julia-stats" group.
> > > > > > > To unsubscribe from this group and stop receiving emails from it,
> send an email to [email protected].
> > > > > > > For more options, visit https://groups.google.com/d/optout.
> > > >
> > >
> > >
> > --
> > You received this message because you are subscribed to the Google
> Groups "julia-stats" group.
> > > To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> > > For more options, visit https://groups.google.com/d/optout.
> 
> --
> You received this message because you are subscribed to the Google Groups
> "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

RE: [julia-stats] DataFrame and Memory Limitations

Reply via email to