Re: [julia-users] Re: Large Data Sets in Julia

Tim Holy Wed, 11 Nov 2015 17:09:08 -0800

No, it's for taking an "array" that's already in a file and sharing it among 
workers via mmap.


Docs would be great. I might contribute them, but anyone else is welcome to do 
so.

--Tim

On Wednesday, November 11, 2015 04:31:58 PM John Brock wrote:
> Thanks, Tomas, but I wasn't referring to the keyword arguments. One
> signature starts with a filename argument, and the other doesn't. What's
> the difference? Is the filename specifying the location at which to create
> a memory mapped file?
> 
> On Wednesday, November 11, 2015 at 3:33:56 AM UTC-8, Tomas Lycken wrote:
> > Everything after the semicolon is keyword arguments, and will dispatch to
> > the same method as if they are left out. Thus, the documentation for
> > SharedArray(T, dims; init=false, pids=[]) is valid for SharedArray(T,
> > dims) too, and the values of init and pids will be the ones given in the
> > signature.
> > 
> > // T
> > 
> > On Monday, November 9, 2015 at 9:43:17 PM UTC+1, John Brock wrote:
> > 
> > It looks like SharedArray(filename, T, dims) isn't documented,
> > 
> >> but SharedArray(T, dims; init=false, pids=Int[]) is. What's the
> >> difference?
> >> 
> >> On Friday, November 6, 2015 at 2:21:01 AM UTC-8, Tim Holy wrote:
> >>> Not sure if it's as high-level as you're hoping for, but julia has great
> >>> support for arrays that are much bigger than memory. See Mmap.mmap and
> >>> SharedArray(filename, T, dims).
> >>> 
> >>> --Tim
> >>> 
> >>> On Thursday, November 05, 2015 06:33:52 PM André Lage wrote:
> >>> > hi Viral,
> >>> > 
> >>> > Do you have any news on this?
> >>> > 
> >>> > André Lage.
> >>> > 
> >>> > On Wednesday, July 3, 2013 at 5:12:06 AM UTC-3, Viral Shah wrote:
> >>> > > Hi all,
> >>> > > 
> >>> > > I am cross-posting my reply to julia-stats and julia-users as there
> >>> 
> >>> was a
> >>> 
> >>> > > separate post on large logistic regressions on julia-users too.
> >>> > > 
> >>> > > Just as these questions came up, Tanmay and I have been chatting
> >>> 
> >>> about a
> >>> 
> >>> > > general framework for working on problems that are too large to fit
> >>> 
> >>> in
> >>> 
> >>> > > memory, or need parallelism for performance. The idea is simple and
> >>> 
> >>> based
> >>> 
> >>> > > on providing a convenient and generic way to break up a problem into
> >>> > > subproblems, each of which can then be scheduled to run anywhere. To
> >>> 
> >>> start
> >>> 
> >>> > > with, we will implement a map and mapreduce using this, and we hope
> >>> 
> >>> that
> >>> 
> >>> > > it
> >>> > > should be able to handle large files sequentially, distributed data
> >>> > > in-memory, and distributed filesystems within the same framework. Of
> >>> > > course, this all sounds too good to be true. We are trying out a
> >>> 
> >>> simple
> >>> 
> >>> > > implementation, and if early results are promising, we can have a
> >>> 
> >>> detailed
> >>> 
> >>> > > discussion on API design and implementation.
> >>> > > 
> >>> > > Doug, I would love to see if we can use some of this work to
> >>> 
> >>> parallelize
> >>> 
> >>> > > GLM at a higher level than using remotecall and fetch.
> >>> > > 
> >>> > > -viral
> >>> > > 
> >>> > > On Tuesday, July 2, 2013 11:10:35 PM UTC+5:30, Douglas Bates wrote:
> >>> > >> On Tuesday, July 2, 2013 6:26:33 AM UTC-5, Raj DG wrote:
> >>> > >>> Hi all,
> >>> > >>> 
> >>> > >>> I am a regular user of R and also use it for handling very large
> >>> 
> >>> data
> >>> 
> >>> > >>> sets (~ 50 GB). We have enough RAM to fit all that data into
> >>> 
> >>> memory for
> >>> 
> >>> > >>> processing, so don't really need to do anything additional to
> >>> 
> >>> chunk,
> >>> 
> >>> > >>> etc.
> >>> > >>> 
> >>> > >>> I wanted to get an idea of whether anyone has, in practice,
> >>> 
> >>> performed
> >>> 
> >>> > >>> analysis on large data sets using Julia. Use cases range from
> >>> 
> >>> performing
> >>> 
> >>> > >>> Cox Regression on ~ 40 million rows and over 10 independent
> >>> 
> >>> variables to
> >>> 
> >>> > >>> simple statistical analysis using T-Tests, etc. Also, how does the
> >>> > >>> timings
> >>> > >>> for operations like logistic regressions compare to Julia ? Are
> >>> 
> >>> there
> >>> 
> >>> > >>> any
> >>> > >>> libraries/packages that can perform Cox, Poisson (Negative
> >>> 
> >>> Binomial),
> >>> 
> >>> > >>> and
> >>> > >>> other regression types ?
> >>> > >>> 
> >>> > >>> The benchmarks for Julia look promising, but in today's age of the
> >>> 
> >>> "big
> >>> 
> >>> > >>> data", it seems that the capability of handling large data is a
> >>> > >>> pre-requisite to the future success of any new platform or
> >>> 
> >>> language.
> >>> 
> >>> > >>> Looking forward to your feedback,
> >>> > >> 
> >>> > >> I think the potential for working with large data sets in Julia is
> >>> 
> >>> better
> >>> 
> >>> > >> than that in R.  Among other things Julia allows for memory-mapped
> >>> 
> >>> files
> >>> 
> >>> > >> and for distributed arrays, both of which have great potential.
> >>> > >> 
> >>> > >> I have been working with some Biostatisticians on a prototype
> >>> 
> >>> package for
> >>> 
> >>> > >> working with snp data of the sort generated in genome-wide
> >>> 
> >>> association
> >>> 
> >>> > >> studies.  Current data sizes can be information on tens of
> >>> 
> >>> thousands of
> >>> 
> >>> > >> individuals (rows) for over a million snp positions (columns).  The
> >>> > >> nature
> >>> > >> of the data is such that each position provides one of four
> >>> 
> >>> potential
> >>> 
> >>> > >> values, including a missing value.  A compact storage format using
> >>> 
> >>> 2 bits
> >>> 
> >>> > >> per position is widely used for such data.  We are able to read and
> >>> > >> process
> >>> > >> such a large array in a few seconds using memory-mapped files in
> >>> 
> >>> Julia.
> >>> 
> >>> > >>  The amazing thing is that the code is pure Julia.  When I write in
> >>> 
> >>> R I
> >>> 
> >>> > >>  am
> >>> > >> 
> >>> > >> always conscious of the bottlenecks and the need to write C or C++
> >>> 
> >>> code
> >>> 
> >>> > >> for
> >>> > >> those places.  I haven't encountered cases where I need to write
> >>> 
> >>> new code
> >>> 
> >>> > >> in a compiled language to speed up a Julia function.  I have
> >>> 
> >>> interfaced
> >>> 
> >>> > >> to
> >>> > >> existing numerical libraries but not writing fresh code.
> >>> > >> 
> >>> > >> As John mentioned I have written the GLM package allowing for hooks
> >>> 
> >>> to
> >>> 
> >>> > >> use distributed arrays.  As yet I haven't had a large enough
> >>> 
> >>> problem to
> >>> 
> >>> > >> warrant fleshing out those hooks but I could be persuaded.
> >>> 
> >>>

Re: [julia-users] Re: Large Data Sets in Julia

Reply via email to