In my experience, Hadoop is pretty terrible about minimizing data movement; Spark seems to be significantly better. The only codes that really nail it are carefully handcrafted HPC codes.
On Tue, Oct 6, 2015 at 12:57 PM, Andrei Zh <[email protected]> wrote: > Yet, calling Julia processes on other machines via ssh doesn't address > data locality. In big data systems (say, > 1TB) main performance concern is > not a number of CPUs, but IO operations and data movement across a cluster, > so map reduce tries to do as much as possible on local data without any > movement (map phase) and then combine results globally (reduce phase). This > way little program is send to data nodes instead of huge data being sent to > program's node(s). > > As far as I know, Julia doesn't provide any tools for working with huge > distributed datasets, that's why I say it doesn't give you Hadoop- (or > Spark-, or Google-like) map-reduce. But it's quite easy to add these > features of MR too. E.g. one can use Elly.jl to access HDFS (including > location of data blocks) and execute tasks using remotecall() on a Julia > worker which is closest to data. > > > On Tuesday, October 6, 2015 at 8:03:57 PM UTC+3, Stefan Karpinski wrote: >> >> That works fine in a distributed setting if you start Julia workers on >> other machines, so it is actually a legitimate form of map reduce. It >> doesn't do anything for handling machine failures, however, which was >> arguably the major concern of the original MapReduce design. >> >> On Tue, Oct 6, 2015 at 10:24 AM, Andrei Zh <[email protected]> wrote: >> >>> Julia supports multiprocessing pretty well, including map-reduce-like >>> jobs. E.g. in the next example I add 3 processes to a "workgroup", >>> distribute simulation between them and then reduce results via (+) operator: >>> >>> >>> julia> addprocs(3) >>> 3-element Array{Int64,1}: >>> 2 >>> 3 >>> 4 >>> >>> >>> julia> nheads = @parallel (+) for i=1:200000000 >>> Int(rand(Bool)) >>> end >>> 100008845 >>> >>> You can find full example and a lot of other fun in official >>> documentation on parallel computing: >>> >>> http://julia.readthedocs.org/en/latest/manual/parallel-computing/ >>> >>> Note, though, that it's not real (i.e. Hadoop/Spark-like) map-reduce, >>> since original idea of MR concerns distributed systems and data-local >>> computations, while here we do everything on the same machine. If you are >>> looking for big data solution, search this forum for some (dead or alive) >>> projects for it. >>> >>> >>> >>> On Monday, October 5, 2015 at 11:52:21 PM UTC+3, cheng wang wrote: >>>> >>>> Hello everyone, >>>> >>>> I am a Julia newbie. I am thrilled by Julia recently. It's an amazing >>>> language! >>>> >>>> I notice that julia currently does not have good support for >>>> multi-threading programming. >>>> So I am thinking that a spark-like mapreduce parallel model + >>>> multi-process maybe enough. >>>> It is easy to be thread-safe and It could solve most vector-based >>>> computation. >>>> >>>> This idea might be too naive. However, I am happy to see your opinions. >>>> >>>> Thanks in advance, >>>> Cheng >>>> >>> >>
