On 27 June 2014 16:40, John Lilley <[email protected]> wrote:

> Our software doesn't use MapReduce. It is a pure YARN application that is
> basically a peer to MapReduce. There are a lot of reasons for this
> decision, but the main one is that we have a large code base that already
> executes data transformations in a single-server environment, and we wanted
> to produce a product without rewriting huge swaths of code.


I understand the reasons for that. Using YARN gives you are lot more
control.

Have you seen the slides I've written on this topic?
https://speakerdeck.com/stevel/secrets-of-yarn-application-development


> Given that, our software takes care of many things usually delegated to
> MapReduce, including distributed sort/partition (i.e. "the shuffle").
> However, MapReduce has a special place in the ecosystem, in that it creates
> an auxiliary service to handle the distribution of shuffle data to
> reducers. It doesn't look like third-party apps have an easy time
> installing aux services. The JARs for any such service must be in Hadoop's
> classpath on all nodes at startup, creating both a management issue and a
> trust/security issue. Currently our software places temporary data into
> HDFS for this purpose, but we've found that HDFS has a huge overhead in
> terms of performance and file handles, even at low replication. We desire
> to replace the use of HDFS with a lighter-weight service to manage temp
> files and distribute their data.
>



>
> Is the slider project something that can address our needs?
>
> John Lilley
>
>
Not directly, no.

We keep all our JARs and the apps packages to install in HDFS. We then rely
on HDFS to replicate content, and YARN to pull down the binaries, and unzip
any archives. We can afford the download penalty because we expect
slider-deployed services to be present for an extended period of time.

Keeping everything out of the machine clusters simplifies versioning and
maintenance: anyone can run different versions of your application
side-by-side. As you note, MapReduce has a special place, but if you look
at Hive and Tez, I think it uploads everything.

If you look at Vinod's details on resource localization, it describes what
goes on, including what caching takes place
http://hortonworks.com/blog/resource-localization-in-yarn-deep-dive/

from that -and this is a guess, I haven't played with it- , if you mark
artifacts as Private rather than application, then it seems that they can
be cached across instances of an application deployed by the same user; if
you mark them as public they can be shared by all. There's some parameters
for yarn-site.xml to tune the amount of caching.


There's also HDFS cache management, which is targeted at data, not code:
http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html

Finally, if you really want to bypass HDFS, note that you can give other
paths to the localizer, as long as they are compatible filesystems, e.g.
s3n:// or swift://

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Reply via email to