Our software doesn't use MapReduce. It is a pure YARN application that is 
basically a peer to MapReduce. There are a lot of reasons for this decision, 
but the main one is that we have a large code base that already executes data 
transformations in a single-server environment, and we wanted to produce a 
product without rewriting huge swaths of code. Given that, our software takes 
care of many things usually delegated to MapReduce, including distributed 
sort/partition (i.e. "the shuffle"). However, MapReduce has a special place in 
the ecosystem, in that it creates an auxiliary service to handle the 
distribution of shuffle data to reducers. It doesn't look like third-party apps 
have an easy time installing aux services. The JARs for any such service must 
be in Hadoop's classpath on all nodes at startup, creating both a management 
issue and a trust/security issue. Currently our software places temporary data 
into HDFS for this purpose, but we've found that HDFS has a huge overhead in 
terms of performance and file handles, even at low replication. We desire to 
replace the use of HDFS with a lighter-weight service to manage temp files and 
distribute their data.

Is the slider project something that can address our needs?

John Lilley

Reply via email to