On 27 June 2014 16:40, John Lilley <[email protected]> wrote: > Our software doesn't use MapReduce. It is a pure YARN application that is > basically a peer to MapReduce. There are a lot of reasons for this > decision, but the main one is that we have a large code base that already > executes data transformations in a single-server environment, and we wanted > to produce a product without rewriting huge swaths of code.
I understand the reasons for that. Using YARN gives you are lot more control. Have you seen the slides I've written on this topic? https://speakerdeck.com/stevel/secrets-of-yarn-application-development > Given that, our software takes care of many things usually delegated to > MapReduce, including distributed sort/partition (i.e. "the shuffle"). > However, MapReduce has a special place in the ecosystem, in that it creates > an auxiliary service to handle the distribution of shuffle data to > reducers. It doesn't look like third-party apps have an easy time > installing aux services. The JARs for any such service must be in Hadoop's > classpath on all nodes at startup, creating both a management issue and a > trust/security issue. Currently our software places temporary data into > HDFS for this purpose, but we've found that HDFS has a huge overhead in > terms of performance and file handles, even at low replication. We desire > to replace the use of HDFS with a lighter-weight service to manage temp > files and distribute their data. > > > Is the slider project something that can address our needs? > > John Lilley > > Not directly, no. We keep all our JARs and the apps packages to install in HDFS. We then rely on HDFS to replicate content, and YARN to pull down the binaries, and unzip any archives. We can afford the download penalty because we expect slider-deployed services to be present for an extended period of time. Keeping everything out of the machine clusters simplifies versioning and maintenance: anyone can run different versions of your application side-by-side. As you note, MapReduce has a special place, but if you look at Hive and Tez, I think it uploads everything. If you look at Vinod's details on resource localization, it describes what goes on, including what caching takes place http://hortonworks.com/blog/resource-localization-in-yarn-deep-dive/ from that -and this is a guess, I haven't played with it- , if you mark artifacts as Private rather than application, then it seems that they can be cached across instances of an application deployed by the same user; if you mark them as public they can be shared by all. There's some parameters for yarn-site.xml to tune the amount of caching. There's also HDFS cache management, which is targeted at data, not code: http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html Finally, if you really want to bypass HDFS, note that you can give other paths to the localizer, as long as they are compatible filesystems, e.g. s3n:// or swift:// -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
