See also an earlier discussion <https://groups.google.com/d/msg/julia-users/WJBIAYzrZgg/_UPc8vhkCAAJ> on a similar topic, for an out-of-core approach.
---david On Tuesday, October 6, 2015 at 10:29:52 PM UTC+2, Tim Holy wrote: > > There's > > https://github.com/JuliaParallel/DistributedArrays.jl > https://github.com/JuliaParallel/HDFS.jl > > in case they help. (See the other packages in JuliaParallel, in case you > have > missed that organization.) > > --Tim > > On Tuesday, October 06, 2015 12:57:17 PM Andrei Zh wrote: > > Yet, calling Julia processes on other machines via ssh doesn't address > data > > locality. In big data systems (say, > 1TB) main performance concern is > not > > a number of CPUs, but IO operations and data movement across a cluster, > so > > map reduce tries to do as much as possible on local data without any > > movement (map phase) and then combine results globally (reduce phase). > This > > way little program is send to data nodes instead of huge data being sent > to > > program's node(s). > > > > As far as I know, Julia doesn't provide any tools for working with huge > > distributed datasets, that's why I say it doesn't give you Hadoop- (or > > Spark-, or Google-like) map-reduce. But it's quite easy to add these > > features of MR too. E.g. one can use Elly.jl to access HDFS (including > > location of data blocks) and execute tasks using remotecall() on a Julia > > worker which is closest to data. > > > > On Tuesday, October 6, 2015 at 8:03:57 PM UTC+3, Stefan Karpinski wrote: > > > That works fine in a distributed setting if you start Julia workers on > > > other machines, so it is actually a legitimate form of map reduce. It > > > doesn't do anything for handling machine failures, however, which was > > > arguably the major concern of the original MapReduce design. > > > > > > On Tue, Oct 6, 2015 at 10:24 AM, Andrei Zh <[email protected] > > > > > > <javascript:>> wrote: > > >> Julia supports multiprocessing pretty well, including map-reduce-like > > >> jobs. E.g. in the next example I add 3 processes to a "workgroup", > > >> distribute simulation between them and then reduce results via (+) > > >> operator: > > >> > > >> > > >> julia> addprocs(3) > > >> > > >> 3-element Array{Int64,1}: > > >> 2 > > >> 3 > > >> 4 > > >> > > >> julia> nheads = @parallel (+) for i=1:200000000 > > >> > > >> Int(rand(Bool)) > > >> > > >> end > > >> > > >> 100008845 > > >> > > >> You can find full example and a lot of other fun in official > > >> documentation on parallel computing: > > >> > > >> http://julia.readthedocs.org/en/latest/manual/parallel-computing/ > > >> > > >> Note, though, that it's not real (i.e. Hadoop/Spark-like) map-reduce, > > >> since original idea of MR concerns distributed systems and data-local > > >> computations, while here we do everything on the same machine. If you > are > > >> looking for big data solution, search this forum for some (dead or > alive) > > >> projects for it. > > >> > > >> On Monday, October 5, 2015 at 11:52:21 PM UTC+3, cheng wang wrote: > > >>> Hello everyone, > > >>> > > >>> I am a Julia newbie. I am thrilled by Julia recently. It's an > amazing > > >>> language! > > >>> > > >>> I notice that julia currently does not have good support for > > >>> multi-threading programming. > > >>> So I am thinking that a spark-like mapreduce parallel model + > > >>> multi-process maybe enough. > > >>> It is easy to be thread-safe and It could solve most vector-based > > >>> computation. > > >>> > > >>> This idea might be too naive. However, I am happy to see your > opinions. > > >>> > > >>> Thanks in advance, > > >>> Cheng > >
