Re: [julia-users] Re: Large Data Sets in Julia

Tim Holy Sat, 02 Jan 2016 09:03:44 -0800

On Monday, November 09, 2015 12:43:17 PM John Brock wrote:
> It looks like SharedArray(filename, T, dims) isn't documented,
> but SharedArray(T, dims; init=false, pids=Int[]) is. What's the difference?


See #14532.

--Tim

> 
> On Friday, November 6, 2015 at 2:21:01 AM UTC-8, Tim Holy wrote:
> > Not sure if it's as high-level as you're hoping for, but julia has great
> > support for arrays that are much bigger than memory. See Mmap.mmap and
> > SharedArray(filename, T, dims).
> > 
> > --Tim
> > 
> > On Thursday, November 05, 2015 06:33:52 PM André Lage wrote:
> > > hi Viral,
> > > 
> > > Do you have any news on this?
> > > 
> > > André Lage.
> > > 
> > > On Wednesday, July 3, 2013 at 5:12:06 AM UTC-3, Viral Shah wrote:
> > > > Hi all,
> > > > 
> > > > I am cross-posting my reply to julia-stats and julia-users as there
> > 
> > was a
> > 
> > > > separate post on large logistic regressions on julia-users too.
> > > > 
> > > > Just as these questions came up, Tanmay and I have been chatting about
> > 
> > a
> > 
> > > > general framework for working on problems that are too large to fit in
> > > > memory, or need parallelism for performance. The idea is simple and
> > 
> > based
> > 
> > > > on providing a convenient and generic way to break up a problem into
> > > > subproblems, each of which can then be scheduled to run anywhere. To
> > 
> > start
> > 
> > > > with, we will implement a map and mapreduce using this, and we hope
> > 
> > that
> > 
> > > > it
> > > > should be able to handle large files sequentially, distributed data
> > > > in-memory, and distributed filesystems within the same framework. Of
> > > > course, this all sounds too good to be true. We are trying out a
> > 
> > simple
> > 
> > > > implementation, and if early results are promising, we can have a
> > 
> > detailed
> > 
> > > > discussion on API design and implementation.
> > > > 
> > > > Doug, I would love to see if we can use some of this work to
> > 
> > parallelize
> > 
> > > > GLM at a higher level than using remotecall and fetch.
> > > > 
> > > > -viral
> > > > 
> > > > On Tuesday, July 2, 2013 11:10:35 PM UTC+5:30, Douglas Bates wrote:
> > > >> On Tuesday, July 2, 2013 6:26:33 AM UTC-5, Raj DG wrote:
> > > >>> Hi all,
> > > >>> 
> > > >>> I am a regular user of R and also use it for handling very large
> > 
> > data
> > 
> > > >>> sets (~ 50 GB). We have enough RAM to fit all that data into memory
> > 
> > for
> > 
> > > >>> processing, so don't really need to do anything additional to chunk,
> > > >>> etc.
> > > >>> 
> > > >>> I wanted to get an idea of whether anyone has, in practice,
> > 
> > performed
> > 
> > > >>> analysis on large data sets using Julia. Use cases range from
> > 
> > performing
> > 
> > > >>> Cox Regression on ~ 40 million rows and over 10 independent
> > 
> > variables to
> > 
> > > >>> simple statistical analysis using T-Tests, etc. Also, how does the
> > > >>> timings
> > > >>> for operations like logistic regressions compare to Julia ? Are
> > 
> > there
> > 
> > > >>> any
> > > >>> libraries/packages that can perform Cox, Poisson (Negative
> > 
> > Binomial),
> > 
> > > >>> and
> > > >>> other regression types ?
> > > >>> 
> > > >>> The benchmarks for Julia look promising, but in today's age of the
> > 
> > "big
> > 
> > > >>> data", it seems that the capability of handling large data is a
> > > >>> pre-requisite to the future success of any new platform or language.
> > > >>> Looking forward to your feedback,
> > > >> 
> > > >> I think the potential for working with large data sets in Julia is
> > 
> > better
> > 
> > > >> than that in R.  Among other things Julia allows for memory-mapped
> > 
> > files
> > 
> > > >> and for distributed arrays, both of which have great potential.
> > > >> 
> > > >> I have been working with some Biostatisticians on a prototype package
> > 
> > for
> > 
> > > >> working with snp data of the sort generated in genome-wide
> > 
> > association
> > 
> > > >> studies.  Current data sizes can be information on tens of thousands
> > 
> > of
> > 
> > > >> individuals (rows) for over a million snp positions (columns).  The
> > > >> nature
> > > >> of the data is such that each position provides one of four potential
> > > >> values, including a missing value.  A compact storage format using 2
> > 
> > bits
> > 
> > > >> per position is widely used for such data.  We are able to read and
> > > >> process
> > > >> such a large array in a few seconds using memory-mapped files in
> > 
> > Julia.
> > 
> > > >>  The amazing thing is that the code is pure Julia.  When I write in R
> > 
> > I
> > 
> > > >>  am
> > > >> 
> > > >> always conscious of the bottlenecks and the need to write C or C++
> > 
> > code
> > 
> > > >> for
> > > >> those places.  I haven't encountered cases where I need to write new
> > 
> > code
> > 
> > > >> in a compiled language to speed up a Julia function.  I have
> > 
> > interfaced
> > 
> > > >> to
> > > >> existing numerical libraries but not writing fresh code.
> > > >> 
> > > >> As John mentioned I have written the GLM package allowing for hooks
> > 
> > to
> > 
> > > >> use distributed arrays.  As yet I haven't had a large enough problem
> > 
> > to
> > 
> > > >> warrant fleshing out those hooks but I could be persuaded.

Re: [julia-users] Re: Large Data Sets in Julia

Reply via email to