[julia-users] Re: Large Data Sets in Julia

André Lage Thu, 05 Nov 2015 18:34:47 -0800

hi Viral,

Do you have any news on this?


André Lage.

On Wednesday, July 3, 2013 at 5:12:06 AM UTC-3, Viral Shah wrote:
>
> Hi all,
>
> I am cross-posting my reply to julia-stats and julia-users as there was a 
> separate post on large logistic regressions on julia-users too.
>
> Just as these questions came up, Tanmay and I have been chatting about a 
> general framework for working on problems that are too large to fit in 
> memory, or need parallelism for performance. The idea is simple and based 
> on providing a convenient and generic way to break up a problem into 
> subproblems, each of which can then be scheduled to run anywhere. To start 
> with, we will implement a map and mapreduce using this, and we hope that it 
> should be able to handle large files sequentially, distributed data 
> in-memory, and distributed filesystems within the same framework. Of 
> course, this all sounds too good to be true. We are trying out a simple 
> implementation, and if early results are promising, we can have a detailed 
> discussion on API design and implementation.
>
> Doug, I would love to see if we can use some of this work to parallelize 
> GLM at a higher level than using remotecall and fetch.
>
> -viral
>
> On Tuesday, July 2, 2013 11:10:35 PM UTC+5:30, Douglas Bates wrote:
>>
>> On Tuesday, July 2, 2013 6:26:33 AM UTC-5, Raj DG wrote:
>>
>>> Hi all,
>>>
>>> I am a regular user of R and also use it for handling very large data 
>>> sets (~ 50 GB). We have enough RAM to fit all that data into memory for 
>>> processing, so don't really need to do anything additional to chunk, etc.
>>>
>>> I wanted to get an idea of whether anyone has, in practice, performed 
>>> analysis on large data sets using Julia. Use cases range from performing 
>>> Cox Regression on ~ 40 million rows and over 10 independent variables to 
>>> simple statistical analysis using T-Tests, etc. Also, how does the timings 
>>> for operations like logistic regressions compare to Julia ? Are there any 
>>> libraries/packages that can perform Cox, Poisson (Negative Binomial), and 
>>> other regression types ?
>>>
>>> The benchmarks for Julia look promising, but in today's age of the "big 
>>> data", it seems that the capability of handling large data is a 
>>> pre-requisite to the future success of any new platform or language. 
>>> Looking forward to your feedback,
>>>
>>
>> I think the potential for working with large data sets in Julia is better 
>> than that in R.  Among other things Julia allows for memory-mapped files 
>> and for distributed arrays, both of which have great potential.
>>
>> I have been working with some Biostatisticians on a prototype package for 
>> working with snp data of the sort generated in genome-wide association 
>> studies.  Current data sizes can be information on tens of thousands of 
>> individuals (rows) for over a million snp positions (columns).  The nature 
>> of the data is such that each position provides one of four potential 
>> values, including a missing value.  A compact storage format using 2 bits 
>> per position is widely used for such data.  We are able to read and process 
>> such a large array in a few seconds using memory-mapped files in Julia. 
>>  The amazing thing is that the code is pure Julia.  When I write in R I am 
>> always conscious of the bottlenecks and the need to write C or C++ code for 
>> those places.  I haven't encountered cases where I need to write new code 
>> in a compiled language to speed up a Julia function.  I have interfaced to 
>> existing numerical libraries but not writing fresh code.
>>
>> As John mentioned I have written the GLM package allowing for hooks to 
>> use distributed arrays.  As yet I haven't had a large enough problem to 
>> warrant fleshing out those hooks but I could be persuaded.
>>
>>

[julia-users] Re: Large Data Sets in Julia

Reply via email to