hi Viral, Do you have any news on this?
André Lage. On Wednesday, July 3, 2013 at 5:12:06 AM UTC-3, Viral Shah wrote: > > Hi all, > > I am cross-posting my reply to julia-stats and julia-users as there was a > separate post on large logistic regressions on julia-users too. > > Just as these questions came up, Tanmay and I have been chatting about a > general framework for working on problems that are too large to fit in > memory, or need parallelism for performance. The idea is simple and based > on providing a convenient and generic way to break up a problem into > subproblems, each of which can then be scheduled to run anywhere. To start > with, we will implement a map and mapreduce using this, and we hope that it > should be able to handle large files sequentially, distributed data > in-memory, and distributed filesystems within the same framework. Of > course, this all sounds too good to be true. We are trying out a simple > implementation, and if early results are promising, we can have a detailed > discussion on API design and implementation. > > Doug, I would love to see if we can use some of this work to parallelize > GLM at a higher level than using remotecall and fetch. > > -viral > > On Tuesday, July 2, 2013 11:10:35 PM UTC+5:30, Douglas Bates wrote: >> >> On Tuesday, July 2, 2013 6:26:33 AM UTC-5, Raj DG wrote: >> >>> Hi all, >>> >>> I am a regular user of R and also use it for handling very large data >>> sets (~ 50 GB). We have enough RAM to fit all that data into memory for >>> processing, so don't really need to do anything additional to chunk, etc. >>> >>> I wanted to get an idea of whether anyone has, in practice, performed >>> analysis on large data sets using Julia. Use cases range from performing >>> Cox Regression on ~ 40 million rows and over 10 independent variables to >>> simple statistical analysis using T-Tests, etc. Also, how does the timings >>> for operations like logistic regressions compare to Julia ? Are there any >>> libraries/packages that can perform Cox, Poisson (Negative Binomial), and >>> other regression types ? >>> >>> The benchmarks for Julia look promising, but in today's age of the "big >>> data", it seems that the capability of handling large data is a >>> pre-requisite to the future success of any new platform or language. >>> Looking forward to your feedback, >>> >> >> I think the potential for working with large data sets in Julia is better >> than that in R. Among other things Julia allows for memory-mapped files >> and for distributed arrays, both of which have great potential. >> >> I have been working with some Biostatisticians on a prototype package for >> working with snp data of the sort generated in genome-wide association >> studies. Current data sizes can be information on tens of thousands of >> individuals (rows) for over a million snp positions (columns). The nature >> of the data is such that each position provides one of four potential >> values, including a missing value. A compact storage format using 2 bits >> per position is widely used for such data. We are able to read and process >> such a large array in a few seconds using memory-mapped files in Julia. >> The amazing thing is that the code is pure Julia. When I write in R I am >> always conscious of the bottlenecks and the need to write C or C++ code for >> those places. I haven't encountered cases where I need to write new code >> in a compiled language to speed up a Julia function. I have interfaced to >> existing numerical libraries but not writing fresh code. >> >> As John mentioned I have written the GLM package allowing for hooks to >> use distributed arrays. As yet I haven't had a large enough problem to >> warrant fleshing out those hooks but I could be persuaded. >> >>
