> In my experience, Hadoop is pretty terrible about minimizing data > movement; Spark seems to be significantly better. > > If you mean MapReduce (the framework, version 1 or 2), it doesn't move data anywhere unless you tell it to do so in reduce phase. You could experience another issue with MR1 - multiple reads and writes to disk on multistage jobs, which makes them terrrrribly slow. (Recall, that Hadoop was born to efficiently and reliably download and store millions of web pages obtained using Nutch, not to write iterative machine learning algorithms.)
The only codes that really nail it are carefully handcrafted HPC codes. Could you please elaborate on this? I think I know Spark code quite well, but can't connect it to the notion of handcrafted HPC code. On Tue, Oct 6, 2015 at 12:57 PM, Andrei Zh <[email protected] > <javascript:>> wrote: > >> Yet, calling Julia processes on other machines via ssh doesn't address >> data locality. In big data systems (say, > 1TB) main performance concern is >> not a number of CPUs, but IO operations and data movement across a cluster, >> so map reduce tries to do as much as possible on local data without any >> movement (map phase) and then combine results globally (reduce phase). This >> way little program is send to data nodes instead of huge data being sent to >> program's node(s). >> >> As far as I know, Julia doesn't provide any tools for working with huge >> distributed datasets, that's why I say it doesn't give you Hadoop- (or >> Spark-, or Google-like) map-reduce. But it's quite easy to add these >> features of MR too. E.g. one can use Elly.jl to access HDFS (including >> location of data blocks) and execute tasks using remotecall() on a Julia >> worker which is closest to data. >> >> >> On Tuesday, October 6, 2015 at 8:03:57 PM UTC+3, Stefan Karpinski wrote: >>> >>> That works fine in a distributed setting if you start Julia workers on >>> other machines, so it is actually a legitimate form of map reduce. It >>> doesn't do anything for handling machine failures, however, which was >>> arguably the major concern of the original MapReduce design. >>> >>> On Tue, Oct 6, 2015 at 10:24 AM, Andrei Zh <[email protected]> wrote: >>> >>>> Julia supports multiprocessing pretty well, including map-reduce-like >>>> jobs. E.g. in the next example I add 3 processes to a "workgroup", >>>> distribute simulation between them and then reduce results via (+) >>>> operator: >>>> >>>> >>>> julia> addprocs(3) >>>> 3-element Array{Int64,1}: >>>> 2 >>>> 3 >>>> 4 >>>> >>>> >>>> julia> nheads = @parallel (+) for i=1:200000000 >>>> Int(rand(Bool)) >>>> end >>>> 100008845 >>>> >>>> You can find full example and a lot of other fun in official >>>> documentation on parallel computing: >>>> >>>> http://julia.readthedocs.org/en/latest/manual/parallel-computing/ >>>> >>>> Note, though, that it's not real (i.e. Hadoop/Spark-like) map-reduce, >>>> since original idea of MR concerns distributed systems and data-local >>>> computations, while here we do everything on the same machine. If you are >>>> looking for big data solution, search this forum for some (dead or alive) >>>> projects for it. >>>> >>>> >>>> >>>> On Monday, October 5, 2015 at 11:52:21 PM UTC+3, cheng wang wrote: >>>>> >>>>> Hello everyone, >>>>> >>>>> I am a Julia newbie. I am thrilled by Julia recently. It's an amazing >>>>> language! >>>>> >>>>> I notice that julia currently does not have good support for >>>>> multi-threading programming. >>>>> So I am thinking that a spark-like mapreduce parallel model + >>>>> multi-process maybe enough. >>>>> It is easy to be thread-safe and It could solve most vector-based >>>>> computation. >>>>> >>>>> This idea might be too naive. However, I am happy to see your opinions. >>>>> >>>>> Thanks in advance, >>>>> Cheng >>>>> >>>> >>> >
