Re: [julia-users] Re: Large Data Sets in Julia

Tim Holy Fri, 06 Nov 2015 02:21:30 -0800

Not sure if it's as high-level as you're hoping for, but julia has great 
support for arrays that are much bigger than memory. See Mmap.mmap and 
SharedArray(filename, T, dims).


--Tim

On Thursday, November 05, 2015 06:33:52 PM André Lage wrote:
> hi Viral,
> 
> Do you have any news on this?
> 
> André Lage.
> 
> On Wednesday, July 3, 2013 at 5:12:06 AM UTC-3, Viral Shah wrote:
> > Hi all,
> > 
> > I am cross-posting my reply to julia-stats and julia-users as there was a
> > separate post on large logistic regressions on julia-users too.
> > 
> > Just as these questions came up, Tanmay and I have been chatting about a
> > general framework for working on problems that are too large to fit in
> > memory, or need parallelism for performance. The idea is simple and based
> > on providing a convenient and generic way to break up a problem into
> > subproblems, each of which can then be scheduled to run anywhere. To start
> > with, we will implement a map and mapreduce using this, and we hope that
> > it
> > should be able to handle large files sequentially, distributed data
> > in-memory, and distributed filesystems within the same framework. Of
> > course, this all sounds too good to be true. We are trying out a simple
> > implementation, and if early results are promising, we can have a detailed
> > discussion on API design and implementation.
> > 
> > Doug, I would love to see if we can use some of this work to parallelize
> > GLM at a higher level than using remotecall and fetch.
> > 
> > -viral
> > 
> > On Tuesday, July 2, 2013 11:10:35 PM UTC+5:30, Douglas Bates wrote:
> >> On Tuesday, July 2, 2013 6:26:33 AM UTC-5, Raj DG wrote:
> >>> Hi all,
> >>> 
> >>> I am a regular user of R and also use it for handling very large data
> >>> sets (~ 50 GB). We have enough RAM to fit all that data into memory for
> >>> processing, so don't really need to do anything additional to chunk,
> >>> etc.
> >>> 
> >>> I wanted to get an idea of whether anyone has, in practice, performed
> >>> analysis on large data sets using Julia. Use cases range from performing
> >>> Cox Regression on ~ 40 million rows and over 10 independent variables to
> >>> simple statistical analysis using T-Tests, etc. Also, how does the
> >>> timings
> >>> for operations like logistic regressions compare to Julia ? Are there
> >>> any
> >>> libraries/packages that can perform Cox, Poisson (Negative Binomial),
> >>> and
> >>> other regression types ?
> >>> 
> >>> The benchmarks for Julia look promising, but in today's age of the "big
> >>> data", it seems that the capability of handling large data is a
> >>> pre-requisite to the future success of any new platform or language.
> >>> Looking forward to your feedback,
> >> 
> >> I think the potential for working with large data sets in Julia is better
> >> than that in R.  Among other things Julia allows for memory-mapped files
> >> and for distributed arrays, both of which have great potential.
> >> 
> >> I have been working with some Biostatisticians on a prototype package for
> >> working with snp data of the sort generated in genome-wide association
> >> studies.  Current data sizes can be information on tens of thousands of
> >> individuals (rows) for over a million snp positions (columns).  The
> >> nature
> >> of the data is such that each position provides one of four potential
> >> values, including a missing value.  A compact storage format using 2 bits
> >> per position is widely used for such data.  We are able to read and
> >> process
> >> such a large array in a few seconds using memory-mapped files in Julia.
> >> 
> >>  The amazing thing is that the code is pure Julia.  When I write in R I
> >>  am
> >> 
> >> always conscious of the bottlenecks and the need to write C or C++ code
> >> for
> >> those places.  I haven't encountered cases where I need to write new code
> >> in a compiled language to speed up a Julia function.  I have interfaced
> >> to
> >> existing numerical libraries but not writing fresh code.
> >> 
> >> As John mentioned I have written the GLM package allowing for hooks to
> >> use distributed arrays.  As yet I haven't had a large enough problem to
> >> warrant fleshing out those hooks but I could be persuaded.

Re: [julia-users] Re: Large Data Sets in Julia

Reply via email to