In my experience, Hadoop is pretty terrible about minimizing data movement;
Spark seems to be significantly better. The only codes that really nail it
are carefully handcrafted HPC codes.

On Tue, Oct 6, 2015 at 12:57 PM, Andrei Zh <[email protected]>
wrote:

> Yet, calling Julia processes on other machines via ssh doesn't address
> data locality. In big data systems (say, > 1TB) main performance concern is
> not a number of CPUs, but IO operations and data movement across a cluster,
> so map reduce tries to do as much as possible on local data without any
> movement (map phase) and then combine results globally (reduce phase). This
> way little program is send to data nodes instead of huge data being sent to
> program's node(s).
>
> As far as I know, Julia doesn't provide any tools for working with huge
> distributed datasets, that's why I say it doesn't give you Hadoop- (or
> Spark-, or Google-like) map-reduce. But it's quite easy to add these
> features of MR too. E.g. one can use Elly.jl to access HDFS (including
> location of data blocks) and execute tasks using remotecall() on a Julia
> worker which is closest to data.
>
>
> On Tuesday, October 6, 2015 at 8:03:57 PM UTC+3, Stefan Karpinski wrote:
>>
>> That works fine in a distributed setting if you start Julia workers on
>> other machines, so it is actually a legitimate form of map reduce. It
>> doesn't do anything for handling machine failures, however, which was
>> arguably the major concern of the original MapReduce design.
>>
>> On Tue, Oct 6, 2015 at 10:24 AM, Andrei Zh <[email protected]> wrote:
>>
>>> Julia supports multiprocessing pretty well, including map-reduce-like
>>> jobs. E.g. in the next example I add 3 processes to a "workgroup",
>>> distribute simulation between them and then reduce results via (+) operator:
>>>
>>>
>>> julia> addprocs(3)
>>> 3-element Array{Int64,1}:
>>>  2
>>>  3
>>>  4
>>>
>>>
>>> julia> nheads = @parallel (+) for i=1:200000000
>>>          Int(rand(Bool))
>>>        end
>>> 100008845
>>>
>>> You can find full example and a lot of other fun in official
>>> documentation on parallel computing:
>>>
>>> http://julia.readthedocs.org/en/latest/manual/parallel-computing/
>>>
>>> Note, though, that it's not real (i.e. Hadoop/Spark-like) map-reduce,
>>> since original idea of MR concerns distributed systems and data-local
>>> computations, while here we do everything on the same machine. If you are
>>> looking for big data solution, search this forum for some (dead or alive)
>>> projects for it.
>>>
>>>
>>>
>>> On Monday, October 5, 2015 at 11:52:21 PM UTC+3, cheng wang wrote:
>>>>
>>>> Hello everyone,
>>>>
>>>> I am a Julia newbie. I am thrilled by Julia recently. It's an amazing
>>>> language!
>>>>
>>>> I notice that julia currently does not have good support for
>>>> multi-threading programming.
>>>> So I am thinking that a spark-like mapreduce parallel model +
>>>> multi-process maybe enough.
>>>> It is easy to be thread-safe and It could solve most vector-based
>>>> computation.
>>>>
>>>> This idea might be too naive. However, I am happy to see your opinions.
>>>>
>>>> Thanks in advance,
>>>> Cheng
>>>>
>>>
>>

Reply via email to