[
https://issues.apache.org/jira/browse/HADOOP-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349332#comment-14349332
]
Colin Patrick McCabe edited comment on HADOOP-11656 at 3/5/15 7:32 PM:
-----------------------------------------------------------------------
bq. Andrew wrote: One additional note related to this, we can spend a lot of
time right now distributing 100s of MBs of jar dependencies when launching a
YARN job. Maybe this is ameliorated by the new shared distributed cache, but
I've heard this come up quite a bit as a complaint. If we could meaningfully
slim down our client, it could lead to a nice win.
I'm frustrated that nobody responded to my earlier suggestion that we
de-duplicate jars. This would drastically reduce the size of our install, and
without rearchitecting anything.
In fact I was so frustrated that I decided to write a program to do it myself
and measure the delta. Here it is:
Before:
{code}
du -h /h
249M /h
{code}
After:
{code}
du -h /h
140M /h
{code}
Seems like deduplicating jars would be a much better project than splitting
into a client jar, if we really cared about this.
And here is the de-duplicator program I wrote (in Go):
{code}
package main
import (
"path/filepath"
"flag"
"fmt"
"os"
)
var basePathToFullPath map[string][]string = map[string][]string{ }
func visit(path string, f os.FileInfo, err error) error {
if err != nil {
panic(err)
}
if f.Mode().IsDir() {
return nil
}
base := filepath.Base(path)
bases := basePathToFullPath[base]
if bases == nil {
bases = make([]string, 0, 1)
}
bases = append(bases, path)
basePathToFullPath[base] = bases
fmt.Printf("%s -> %s\n", base, path)
return nil
}
func main() {
flag.Parse()
if len(os.Args) < 2 {
fmt.Printf("Usage: %s [path]\n", os.Args[0])
os.Exit(1)
}
root := os.Args[1]
err := filepath.Walk(root, visit)
if err != nil {
fmt.Printf("Error while traversing %s: %s\n", root, err.Error())
os.Exit(1)
}
for basePath, fullPaths := range basePathToFullPath {
if len(fullPaths) <= 1 {
continue
}
absPath, err := filepath.Abs(fullPaths[0])
if err != nil {
fmt.Printf("failed to find abspath of %s: %s\n",
fullPaths[0], err.Error())
os.Exit(1)
}
fmt.Printf("Handling %s\n", basePath)
for i := 1; i < len(fullPaths); i++ {
fmt.Printf("rm %s\n", fullPaths[i])
err = os.Remove(fullPaths[i])
if err != nil {
panic(err)
}
fmt.Printf("ln %s %s\n", absPath, fullPaths[i])
err = os.Symlink(absPath, fullPaths[i])
if err != nil {
panic(err)
}
}
}
}
{code}
The measurements I made were made against trunk (branch 3.0.0)
was (Author: cmccabe):
bq. Andrew wrote: One additional note related to this, we can spend a lot of
time right now distributing 100s of MBs of jar dependencies when launching a
YARN job. Maybe this is ameliorated by the new shared distributed cache, but
I've heard this come up quite a bit as a complaint. If we could meaningfully
slim down our client, it could lead to a nice win.
I'm frustrated that nobody responded to my earlier suggestion that we
de-duplicate jars. This would drastically reduce the size of our install, and
without rearchitecting anything.
In fact I was so frustrated that I decided to write a program to do it myself
and measure the delta. Here it is:
Before:
{code}
du -h /h
249M /h
{code}
After:
{code}
du -h /h
140M /h
{code}
Seems like deduplicating jars would be a much better project than splitting
into a client jar, if we really cared about this.
And here is the de-duplicator program I wrote:
{code}
package main
import (
"path/filepath"
"flag"
"fmt"
"os"
)
var basePathToFullPath map[string][]string = map[string][]string{ }
func visit(path string, f os.FileInfo, err error) error {
if err != nil {
panic(err)
}
if f.Mode().IsDir() {
return nil
}
base := filepath.Base(path)
bases := basePathToFullPath[base]
if bases == nil {
bases = make([]string, 0, 1)
}
bases = append(bases, path)
basePathToFullPath[base] = bases
fmt.Printf("%s -> %s\n", base, path)
return nil
}
func main() {
flag.Parse()
if len(os.Args) < 2 {
fmt.Printf("Usage: %s [path]\n", os.Args[0])
os.Exit(1)
}
root := os.Args[1]
err := filepath.Walk(root, visit)
if err != nil {
fmt.Printf("Error while traversing %s: %s\n", root, err.Error())
os.Exit(1)
}
for basePath, fullPaths := range basePathToFullPath {
if len(fullPaths) <= 1 {
continue
}
absPath, err := filepath.Abs(fullPaths[0])
if err != nil {
fmt.Printf("failed to find abspath of %s: %s\n",
fullPaths[0], err.Error())
os.Exit(1)
}
fmt.Printf("Handling %s\n", basePath)
for i := 1; i < len(fullPaths); i++ {
fmt.Printf("rm %s\n", fullPaths[i])
err = os.Remove(fullPaths[i])
if err != nil {
panic(err)
}
fmt.Printf("ln %s %s\n", absPath, fullPaths[i])
err = os.Symlink(absPath, fullPaths[i])
if err != nil {
panic(err)
}
}
}
}
{code}
The measurements I made were made against trunk (branch 3.0.0)
> Classpath isolation for downstream clients
> ------------------------------------------
>
> Key: HADOOP-11656
> URL: https://issues.apache.org/jira/browse/HADOOP-11656
> Project: Hadoop Common
> Issue Type: New Feature
> Reporter: Sean Busbey
> Assignee: Sean Busbey
> Labels: classloading, classpath, dependencies, scripts, shell
>
> Currently, Hadoop exposes downstream clients to a variety of third party
> libraries. As our code base grows and matures we increase the set of
> libraries we rely on. At the same time, as our user base grows we increase
> the likelihood that some downstream project will run into a conflict while
> attempting to use a different version of some library we depend on. This has
> already happened with i.e. Guava several times for HBase, Accumulo, and Spark
> (and I'm sure others).
> While YARN-286 and MAPREDUCE-1700 provided an initial effort, they default to
> off and they don't do anything to help dependency conflicts on the driver
> side or for folks talking to HDFS directly. This should serve as an umbrella
> for changes needed to do things thoroughly on the next major version.
> We should ensure that downstream clients
> 1) can depend on a client artifact for each of HDFS, YARN, and MapReduce that
> doesn't pull in any third party dependencies
> 2) only see our public API classes (or as close to this as feasible) when
> executing user provided code, whether client side in a launcher/driver or on
> the cluster in a container or within MR.
> This provides us with a double benefit: users get less grief when they want
> to run substantially ahead or behind the versions we need and the project is
> freer to change our own dependency versions because they'll no longer be in
> our compatibility promises.
> Project specific task jiras to follow after I get some justifying use cases
> written in the comments.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)