Re: [julia-users] Re: Large Data Sets in Julia

John Brock Wed, 11 Nov 2015 16:32:14 -0800

Thanks, Tomas, but I wasn't referring to the keyword arguments. One 
signature starts with a filename argument, and the other doesn't. What's 
the difference? Is the filename specifying the location at which to create 
a memory mapped file?


On Wednesday, November 11, 2015 at 3:33:56 AM UTC-8, Tomas Lycken wrote:
>
> Everything after the semicolon is keyword arguments, and will dispatch to 
> the same method as if they are left out. Thus, the documentation for 
> SharedArray(T, 
> dims; init=false, pids=[]) is valid for SharedArray(T, dims) too, and the 
> values of init and pids will be the ones given in the signature.
>
> // T
>
> On Monday, November 9, 2015 at 9:43:17 PM UTC+1, John Brock wrote:
>
> It looks like SharedArray(filename, T, dims) isn't documented, 
>> but SharedArray(T, dims; init=false, pids=Int[]) is. What's the difference? 
>>
>> On Friday, November 6, 2015 at 2:21:01 AM UTC-8, Tim Holy wrote:
>>>
>>> Not sure if it's as high-level as you're hoping for, but julia has great 
>>> support for arrays that are much bigger than memory. See Mmap.mmap and 
>>> SharedArray(filename, T, dims). 
>>>
>>> --Tim 
>>>
>>> On Thursday, November 05, 2015 06:33:52 PM André Lage wrote: 
>>> > hi Viral, 
>>> > 
>>> > Do you have any news on this? 
>>> > 
>>> > André Lage. 
>>> > 
>>> > On Wednesday, July 3, 2013 at 5:12:06 AM UTC-3, Viral Shah wrote: 
>>> > > Hi all, 
>>> > > 
>>> > > I am cross-posting my reply to julia-stats and julia-users as there 
>>> was a 
>>> > > separate post on large logistic regressions on julia-users too. 
>>> > > 
>>> > > Just as these questions came up, Tanmay and I have been chatting 
>>> about a 
>>> > > general framework for working on problems that are too large to fit 
>>> in 
>>> > > memory, or need parallelism for performance. The idea is simple and 
>>> based 
>>> > > on providing a convenient and generic way to break up a problem into 
>>> > > subproblems, each of which can then be scheduled to run anywhere. To 
>>> start 
>>> > > with, we will implement a map and mapreduce using this, and we hope 
>>> that 
>>> > > it 
>>> > > should be able to handle large files sequentially, distributed data 
>>> > > in-memory, and distributed filesystems within the same framework. Of 
>>> > > course, this all sounds too good to be true. We are trying out a 
>>> simple 
>>> > > implementation, and if early results are promising, we can have a 
>>> detailed 
>>> > > discussion on API design and implementation. 
>>> > > 
>>> > > Doug, I would love to see if we can use some of this work to 
>>> parallelize 
>>> > > GLM at a higher level than using remotecall and fetch. 
>>> > > 
>>> > > -viral 
>>> > > 
>>> > > On Tuesday, July 2, 2013 11:10:35 PM UTC+5:30, Douglas Bates wrote: 
>>> > >> On Tuesday, July 2, 2013 6:26:33 AM UTC-5, Raj DG wrote: 
>>> > >>> Hi all, 
>>> > >>> 
>>> > >>> I am a regular user of R and also use it for handling very large 
>>> data 
>>> > >>> sets (~ 50 GB). We have enough RAM to fit all that data into 
>>> memory for 
>>> > >>> processing, so don't really need to do anything additional to 
>>> chunk, 
>>> > >>> etc. 
>>> > >>> 
>>> > >>> I wanted to get an idea of whether anyone has, in practice, 
>>> performed 
>>> > >>> analysis on large data sets using Julia. Use cases range from 
>>> performing 
>>> > >>> Cox Regression on ~ 40 million rows and over 10 independent 
>>> variables to 
>>> > >>> simple statistical analysis using T-Tests, etc. Also, how does the 
>>> > >>> timings 
>>> > >>> for operations like logistic regressions compare to Julia ? Are 
>>> there 
>>> > >>> any 
>>> > >>> libraries/packages that can perform Cox, Poisson (Negative 
>>> Binomial), 
>>> > >>> and 
>>> > >>> other regression types ? 
>>> > >>> 
>>> > >>> The benchmarks for Julia look promising, but in today's age of the 
>>> "big 
>>> > >>> data", it seems that the capability of handling large data is a 
>>> > >>> pre-requisite to the future success of any new platform or 
>>> language. 
>>> > >>> Looking forward to your feedback, 
>>> > >> 
>>> > >> I think the potential for working with large data sets in Julia is 
>>> better 
>>> > >> than that in R.  Among other things Julia allows for memory-mapped 
>>> files 
>>> > >> and for distributed arrays, both of which have great potential. 
>>> > >> 
>>> > >> I have been working with some Biostatisticians on a prototype 
>>> package for 
>>> > >> working with snp data of the sort generated in genome-wide 
>>> association 
>>> > >> studies.  Current data sizes can be information on tens of 
>>> thousands of 
>>> > >> individuals (rows) for over a million snp positions (columns).  The 
>>> > >> nature 
>>> > >> of the data is such that each position provides one of four 
>>> potential 
>>> > >> values, including a missing value.  A compact storage format using 
>>> 2 bits 
>>> > >> per position is widely used for such data.  We are able to read and 
>>> > >> process 
>>> > >> such a large array in a few seconds using memory-mapped files in 
>>> Julia. 
>>> > >> 
>>> > >>  The amazing thing is that the code is pure Julia.  When I write in 
>>> R I 
>>> > >>  am 
>>> > >> 
>>> > >> always conscious of the bottlenecks and the need to write C or C++ 
>>> code 
>>> > >> for 
>>> > >> those places.  I haven't encountered cases where I need to write 
>>> new code 
>>> > >> in a compiled language to speed up a Julia function.  I have 
>>> interfaced 
>>> > >> to 
>>> > >> existing numerical libraries but not writing fresh code. 
>>> > >> 
>>> > >> As John mentioned I have written the GLM package allowing for hooks 
>>> to 
>>> > >> use distributed arrays.  As yet I haven't had a large enough 
>>> problem to 
>>> > >> warrant fleshing out those hooks but I could be persuaded. 
>>>
>>> 
>

Re: [julia-users] Re: Large Data Sets in Julia

Reply via email to