On Monday, November 09, 2015 12:43:17 PM John Brock wrote: > It looks like SharedArray(filename, T, dims) isn't documented, > but SharedArray(T, dims; init=false, pids=Int[]) is. What's the difference?
See #14532. --Tim > > On Friday, November 6, 2015 at 2:21:01 AM UTC-8, Tim Holy wrote: > > Not sure if it's as high-level as you're hoping for, but julia has great > > support for arrays that are much bigger than memory. See Mmap.mmap and > > SharedArray(filename, T, dims). > > > > --Tim > > > > On Thursday, November 05, 2015 06:33:52 PM André Lage wrote: > > > hi Viral, > > > > > > Do you have any news on this? > > > > > > André Lage. > > > > > > On Wednesday, July 3, 2013 at 5:12:06 AM UTC-3, Viral Shah wrote: > > > > Hi all, > > > > > > > > I am cross-posting my reply to julia-stats and julia-users as there > > > > was a > > > > > > separate post on large logistic regressions on julia-users too. > > > > > > > > Just as these questions came up, Tanmay and I have been chatting about > > > > a > > > > > > general framework for working on problems that are too large to fit in > > > > memory, or need parallelism for performance. The idea is simple and > > > > based > > > > > > on providing a convenient and generic way to break up a problem into > > > > subproblems, each of which can then be scheduled to run anywhere. To > > > > start > > > > > > with, we will implement a map and mapreduce using this, and we hope > > > > that > > > > > > it > > > > should be able to handle large files sequentially, distributed data > > > > in-memory, and distributed filesystems within the same framework. Of > > > > course, this all sounds too good to be true. We are trying out a > > > > simple > > > > > > implementation, and if early results are promising, we can have a > > > > detailed > > > > > > discussion on API design and implementation. > > > > > > > > Doug, I would love to see if we can use some of this work to > > > > parallelize > > > > > > GLM at a higher level than using remotecall and fetch. > > > > > > > > -viral > > > > > > > > On Tuesday, July 2, 2013 11:10:35 PM UTC+5:30, Douglas Bates wrote: > > > >> On Tuesday, July 2, 2013 6:26:33 AM UTC-5, Raj DG wrote: > > > >>> Hi all, > > > >>> > > > >>> I am a regular user of R and also use it for handling very large > > > > data > > > > > >>> sets (~ 50 GB). We have enough RAM to fit all that data into memory > > > > for > > > > > >>> processing, so don't really need to do anything additional to chunk, > > > >>> etc. > > > >>> > > > >>> I wanted to get an idea of whether anyone has, in practice, > > > > performed > > > > > >>> analysis on large data sets using Julia. Use cases range from > > > > performing > > > > > >>> Cox Regression on ~ 40 million rows and over 10 independent > > > > variables to > > > > > >>> simple statistical analysis using T-Tests, etc. Also, how does the > > > >>> timings > > > >>> for operations like logistic regressions compare to Julia ? Are > > > > there > > > > > >>> any > > > >>> libraries/packages that can perform Cox, Poisson (Negative > > > > Binomial), > > > > > >>> and > > > >>> other regression types ? > > > >>> > > > >>> The benchmarks for Julia look promising, but in today's age of the > > > > "big > > > > > >>> data", it seems that the capability of handling large data is a > > > >>> pre-requisite to the future success of any new platform or language. > > > >>> Looking forward to your feedback, > > > >> > > > >> I think the potential for working with large data sets in Julia is > > > > better > > > > > >> than that in R. Among other things Julia allows for memory-mapped > > > > files > > > > > >> and for distributed arrays, both of which have great potential. > > > >> > > > >> I have been working with some Biostatisticians on a prototype package > > > > for > > > > > >> working with snp data of the sort generated in genome-wide > > > > association > > > > > >> studies. Current data sizes can be information on tens of thousands > > > > of > > > > > >> individuals (rows) for over a million snp positions (columns). The > > > >> nature > > > >> of the data is such that each position provides one of four potential > > > >> values, including a missing value. A compact storage format using 2 > > > > bits > > > > > >> per position is widely used for such data. We are able to read and > > > >> process > > > >> such a large array in a few seconds using memory-mapped files in > > > > Julia. > > > > > >> The amazing thing is that the code is pure Julia. When I write in R > > > > I > > > > > >> am > > > >> > > > >> always conscious of the bottlenecks and the need to write C or C++ > > > > code > > > > > >> for > > > >> those places. I haven't encountered cases where I need to write new > > > > code > > > > > >> in a compiled language to speed up a Julia function. I have > > > > interfaced > > > > > >> to > > > >> existing numerical libraries but not writing fresh code. > > > >> > > > >> As John mentioned I have written the GLM package allowing for hooks > > > > to > > > > > >> use distributed arrays. As yet I haven't had a large enough problem > > > > to > > > > > >> warrant fleshing out those hooks but I could be persuaded.
