[ 
https://issues.apache.org/jira/browse/HADOOP-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349332#comment-14349332
 ] 

Colin Patrick McCabe edited comment on HADOOP-11656 at 3/5/15 7:32 PM:
-----------------------------------------------------------------------

bq. Andrew wrote: One additional note related to this, we can spend a lot of 
time right now distributing 100s of MBs of jar dependencies when launching a 
YARN job. Maybe this is ameliorated by the new shared distributed cache, but 
I've heard this come up quite a bit as a complaint. If we could meaningfully 
slim down our client, it could lead to a nice win.

I'm frustrated that nobody responded to my earlier suggestion that we 
de-duplicate jars.  This would drastically reduce the size of our install, and 
without rearchitecting anything.

In fact I was so frustrated that I decided to write a program to do it myself 
and measure the delta.  Here it is:

Before:
{code}
du -h /h
249M    /h
{code}

After:
{code}
du -h /h
140M    /h
{code}

Seems like deduplicating jars would be a much better project than splitting 
into a client jar, if we really cared about this.

And here is the de-duplicator program I wrote (in Go):
{code}
package main

import (
  "path/filepath"
  "flag"
  "fmt"
  "os"
)

var basePathToFullPath map[string][]string = map[string][]string{ }

func visit(path string, f os.FileInfo, err error) error {
        if err != nil {
                panic(err)
        }
        if f.Mode().IsDir() {
                return nil
        }
        base := filepath.Base(path)
        bases := basePathToFullPath[base]
        if bases == nil {
                bases = make([]string, 0, 1)
        }
        bases = append(bases, path)
        basePathToFullPath[base] = bases
        fmt.Printf("%s -> %s\n", base, path)
        return nil
}

func main() {
        flag.Parse()
        if len(os.Args) < 2 {
                fmt.Printf("Usage: %s [path]\n", os.Args[0])
                os.Exit(1)
        }
        root := os.Args[1]
        err := filepath.Walk(root, visit)
        if err != nil {
                fmt.Printf("Error while traversing %s: %s\n", root, err.Error())
                os.Exit(1)
        }
        for basePath, fullPaths := range basePathToFullPath {
                if len(fullPaths) <= 1 {
                        continue
                }
                absPath, err := filepath.Abs(fullPaths[0])
                if err != nil {
                        fmt.Printf("failed to find abspath of %s: %s\n", 
fullPaths[0], err.Error())
                        os.Exit(1)
                }
                fmt.Printf("Handling %s\n", basePath)
                for i := 1; i < len(fullPaths); i++ {
                        fmt.Printf("rm %s\n", fullPaths[i])
                        err = os.Remove(fullPaths[i])
                        if err != nil {
                                panic(err)
                        }
                        fmt.Printf("ln %s %s\n", absPath, fullPaths[i])
                        err = os.Symlink(absPath, fullPaths[i])
                        if err != nil {
                                panic(err)
                        }
                }
        }
}
{code}

The measurements I made were made against trunk (branch 3.0.0)


was (Author: cmccabe):
bq. Andrew wrote: One additional note related to this, we can spend a lot of 
time right now distributing 100s of MBs of jar dependencies when launching a 
YARN job. Maybe this is ameliorated by the new shared distributed cache, but 
I've heard this come up quite a bit as a complaint. If we could meaningfully 
slim down our client, it could lead to a nice win.

I'm frustrated that nobody responded to my earlier suggestion that we 
de-duplicate jars.  This would drastically reduce the size of our install, and 
without rearchitecting anything.

In fact I was so frustrated that I decided to write a program to do it myself 
and measure the delta.  Here it is:

Before:
{code}
du -h /h
249M    /h
{code}

After:
{code}
du -h /h
140M    /h
{code}

Seems like deduplicating jars would be a much better project than splitting 
into a client jar, if we really cared about this.

And here is the de-duplicator program I wrote:
{code}
package main

import (
  "path/filepath"
  "flag"
  "fmt"
  "os"
)

var basePathToFullPath map[string][]string = map[string][]string{ }

func visit(path string, f os.FileInfo, err error) error {
        if err != nil {
                panic(err)
        }
        if f.Mode().IsDir() {
                return nil
        }
        base := filepath.Base(path)
        bases := basePathToFullPath[base]
        if bases == nil {
                bases = make([]string, 0, 1)
        }
        bases = append(bases, path)
        basePathToFullPath[base] = bases
        fmt.Printf("%s -> %s\n", base, path)
        return nil
}

func main() {
        flag.Parse()
        if len(os.Args) < 2 {
                fmt.Printf("Usage: %s [path]\n", os.Args[0])
                os.Exit(1)
        }
        root := os.Args[1]
        err := filepath.Walk(root, visit)
        if err != nil {
                fmt.Printf("Error while traversing %s: %s\n", root, err.Error())
                os.Exit(1)
        }
        for basePath, fullPaths := range basePathToFullPath {
                if len(fullPaths) <= 1 {
                        continue
                }
                absPath, err := filepath.Abs(fullPaths[0])
                if err != nil {
                        fmt.Printf("failed to find abspath of %s: %s\n", 
fullPaths[0], err.Error())
                        os.Exit(1)
                }
                fmt.Printf("Handling %s\n", basePath)
                for i := 1; i < len(fullPaths); i++ {
                        fmt.Printf("rm %s\n", fullPaths[i])
                        err = os.Remove(fullPaths[i])
                        if err != nil {
                                panic(err)
                        }
                        fmt.Printf("ln %s %s\n", absPath, fullPaths[i])
                        err = os.Symlink(absPath, fullPaths[i])
                        if err != nil {
                                panic(err)
                        }
                }
        }
}
{code}

The measurements I made were made against trunk (branch 3.0.0)

> Classpath isolation for downstream clients
> ------------------------------------------
>
>                 Key: HADOOP-11656
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11656
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>              Labels: classloading, classpath, dependencies, scripts, shell
>
> Currently, Hadoop exposes downstream clients to a variety of third party 
> libraries. As our code base grows and matures we increase the set of 
> libraries we rely on. At the same time, as our user base grows we increase 
> the likelihood that some downstream project will run into a conflict while 
> attempting to use a different version of some library we depend on. This has 
> already happened with i.e. Guava several times for HBase, Accumulo, and Spark 
> (and I'm sure others).
> While YARN-286 and MAPREDUCE-1700 provided an initial effort, they default to 
> off and they don't do anything to help dependency conflicts on the driver 
> side or for folks talking to HDFS directly. This should serve as an umbrella 
> for changes needed to do things thoroughly on the next major version.
> We should ensure that downstream clients
> 1) can depend on a client artifact for each of HDFS, YARN, and MapReduce that 
> doesn't pull in any third party dependencies
> 2) only see our public API classes (or as close to this as feasible) when 
> executing user provided code, whether client side in a launcher/driver or on 
> the cluster in a container or within MR.
> This provides us with a double benefit: users get less grief when they want 
> to run substantially ahead or behind the versions we need and the project is 
> freer to change our own dependency versions because they'll no longer be in 
> our compatibility promises.
> Project specific task jiras to follow after I get some justifying use cases 
> written in the comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to