Re: [julia-users] Re: Implementing mapreduce parallel model (not general multi-threading) ? easy and enough ?

David van Leeuwen Tue, 06 Oct 2015 13:44:25 -0700

See also an earlier discussion 
<https://groups.google.com/d/msg/julia-users/WJBIAYzrZgg/_UPc8vhkCAAJ> on a 
similar topic, for an out-of-core approach.


---david

On Tuesday, October 6, 2015 at 10:29:52 PM UTC+2, Tim Holy wrote:
>
> There's 
>
> https://github.com/JuliaParallel/DistributedArrays.jl 
> https://github.com/JuliaParallel/HDFS.jl 
>
> in case they help. (See the other packages in JuliaParallel, in case you 
> have 
> missed that organization.) 
>
> --Tim 
>
> On Tuesday, October 06, 2015 12:57:17 PM Andrei Zh wrote: 
> > Yet, calling Julia processes on other machines via ssh doesn't address 
> data 
> > locality. In big data systems (say, > 1TB) main performance concern is 
> not 
> > a number of CPUs, but IO operations and data movement across a cluster, 
> so 
> > map reduce tries to do as much as possible on local data without any 
> > movement (map phase) and then combine results globally (reduce phase). 
> This 
> > way little program is send to data nodes instead of huge data being sent 
> to 
> > program's node(s). 
> > 
> > As far as I know, Julia doesn't provide any tools for working with huge 
> > distributed datasets, that's why I say it doesn't give you Hadoop- (or 
> > Spark-, or Google-like) map-reduce. But it's quite easy to add these 
> > features of MR too. E.g. one can use Elly.jl to access HDFS (including 
> > location of data blocks) and execute tasks using remotecall() on a Julia 
> > worker which is closest to data. 
> > 
> > On Tuesday, October 6, 2015 at 8:03:57 PM UTC+3, Stefan Karpinski wrote: 
> > > That works fine in a distributed setting if you start Julia workers on 
> > > other machines, so it is actually a legitimate form of map reduce. It 
> > > doesn't do anything for handling machine failures, however, which was 
> > > arguably the major concern of the original MapReduce design. 
> > > 
> > > On Tue, Oct 6, 2015 at 10:24 AM, Andrei Zh <[email protected] 
> > > 
> > > <javascript:>> wrote: 
> > >> Julia supports multiprocessing pretty well, including map-reduce-like 
> > >> jobs. E.g. in the next example I add 3 processes to a "workgroup", 
> > >> distribute simulation between them and then reduce results via (+) 
> > >> operator: 
> > >> 
> > >> 
> > >> julia> addprocs(3) 
> > >> 
> > >> 3-element Array{Int64,1}: 
> > >>  2 
> > >>  3 
> > >>  4 
> > >> 
> > >> julia> nheads = @parallel (+) for i=1:200000000 
> > >> 
> > >>          Int(rand(Bool)) 
> > >>         
> > >>        end 
> > >> 
> > >> 100008845 
> > >> 
> > >> You can find full example and a lot of other fun in official 
> > >> documentation on parallel computing: 
> > >> 
> > >> http://julia.readthedocs.org/en/latest/manual/parallel-computing/ 
> > >> 
> > >> Note, though, that it's not real (i.e. Hadoop/Spark-like) map-reduce, 
> > >> since original idea of MR concerns distributed systems and data-local 
> > >> computations, while here we do everything on the same machine. If you 
> are 
> > >> looking for big data solution, search this forum for some (dead or 
> alive) 
> > >> projects for it. 
> > >> 
> > >> On Monday, October 5, 2015 at 11:52:21 PM UTC+3, cheng wang wrote: 
> > >>> Hello everyone, 
> > >>> 
> > >>> I am a Julia newbie. I am thrilled by Julia recently. It's an 
> amazing 
> > >>> language! 
> > >>> 
> > >>> I notice that julia currently does not have good support for 
> > >>> multi-threading programming. 
> > >>> So I am thinking that a spark-like mapreduce parallel model + 
> > >>> multi-process maybe enough. 
> > >>> It is easy to be thread-safe and It could solve most vector-based 
> > >>> computation. 
> > >>> 
> > >>> This idea might be too naive. However, I am happy to see your 
> opinions. 
> > >>> 
> > >>> Thanks in advance, 
> > >>> Cheng 
>
>

Re: [julia-users] Re: Implementing mapreduce parallel model (not general multi-threading) ? easy and enough ?

Reply via email to