[ 
https://issues.apache.org/jira/browse/MESOS-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235521#comment-16235521
 ] 

Michael Park edited comment on MESOS-8162 at 11/2/17 10:48 AM:
---------------------------------------------------------------

I experimented with cleaning out these old binary files out of the repository 
with [BFG|https://rtyley.github.io/bfg-repo-cleaner/].
Documenting some of the steps I took and the results I found:

[mpark/mesos|https://github.com/mpark/mesos] is a fork of 
[apache/mesos|https://github.com/apache/mesos], and cloning the 
{{apache/mesos}} repository
via {{git clone g...@github.com:apache/mesos.git}} results in a *401M* 
directory.

I ran the following commands:
{code}
git clone --mirror g...@github.com:mpark/mesos.git mesos-strip
bfg -b 5M -p 
master,1.0.0,1.0.1,1.0.2,1.0.3,1.0.4,1.1.0,1.1.1,1.1.2,1.2.0,1.2.1,1.2.2,1.2.x,1.3.0,1.3.1,1.3.x,1.4.0,1.4.x
git push
{code}

I then cloned the {{mpark/mesos}} repository again via {{git clone 
g...@github.com:mpark/mesos.git}},
and this results in a *243M* directory.

I think the biggest risk is that since we're rewriting history, virtually all 
of the commits get a new commit id.
I'm not exactly sure what problem we would run into, but it just feels 
disruptive. On the other hand,
the new commit message contains the old commit id, so it may not be all that 
much of a problem.

After the {{bfg}} command above, one of the things it says is:

{noformat}
Deleted files
-------------

        Filename                   Git id
        -----------------------------------------------------------------
        boost-1.51.0.tar.gz      | e461b8a4 (6.9 MB)
        grpc-1.4.2.tar.gz        | f4dfe636 (6.1 MB)
        hadoop-0.20.205.0.tar.gz | bc605a36 (93.5 MB)
        protobuf-3.2.0.tar.gz    | 3a212180 (6.5 MB), 6e9bfbfa (6.5 MB)
        protobuf-3.3.0.tar.gz    | 98fbec86 (6.7 MB)
        uming.ttc                | 2042560c (20.1 MB), 72dca440 (20.1 MB)
        zookeeper-3.3.1.tar.gz   | c67deed3 (9.5 MB)
        zookeeper-3.3.4.tar.gz   | 09d49240 (12.9 MB)
        zookeeper-3.3.6.tar.gz   | 5588107a (11.3 MB)
        zookeeper-3.4.5.tar.gz   | 1a547fe1 (15.6 MB)
        zookeeper-3.4.8.tar.gz   | a23d68be (21.2 MB)

In total, 34339 object ids were changed.
{noformat}


was (Author: mcypark):
I experimented with cleaning out these old binary files out of the repository 
with [BFG|https://rtyley.github.io/bfg-repo-cleaner/].
Documenting some of the steps I took and the results I found:

[mpark/mesos|https://github.com/mpark/mesos] is a fork of 
[apache/mesos|https://github.com/apache/mesos], and cloning the 
{{apache/mesos}} repository
via {{git clone g...@github.com:apache/mesos.git}} results in a *401M* 
directory.

I ran the following commands:
{code}
git clone --mirror g...@github.com:mpark/mesos.git mesos-strip
bfg -b 5M -p 
master,1.0.0,1.0.1,1.0.2,1.0.3,1.0.4,1.1.0,1.1.1,1.1.2,1.2.0,1.2.1,1.2.2,1.2.x,1.3.0,1.3.1,1.3.x,1.4.0,1.4.x
git push
{code}

I then cloned the {{mpark/mesos}} repository again via {{git clone 
g...@github.com:apache/mesos.git}},
and this results in a *243M* directory.

I think the biggest risk is that since we're rewriting history, virtually all 
of the commits get a new commit id.
I'm not exactly sure what problem we would run into, but it just feels 
disruptive. On the other hand,
the new commit message contains the old commit id, so it may not be all that 
much of a problem.

After the {{bfg}} command above, one of the things it says is:

{noformat}
Deleted files
-------------

        Filename                   Git id
        -----------------------------------------------------------------
        boost-1.51.0.tar.gz      | e461b8a4 (6.9 MB)
        grpc-1.4.2.tar.gz        | f4dfe636 (6.1 MB)
        hadoop-0.20.205.0.tar.gz | bc605a36 (93.5 MB)
        protobuf-3.2.0.tar.gz    | 3a212180 (6.5 MB), 6e9bfbfa (6.5 MB)
        protobuf-3.3.0.tar.gz    | 98fbec86 (6.7 MB)
        uming.ttc                | 2042560c (20.1 MB), 72dca440 (20.1 MB)
        zookeeper-3.3.1.tar.gz   | c67deed3 (9.5 MB)
        zookeeper-3.3.4.tar.gz   | 09d49240 (12.9 MB)
        zookeeper-3.3.6.tar.gz   | 5588107a (11.3 MB)
        zookeeper-3.4.5.tar.gz   | 1a547fe1 (15.6 MB)
        zookeeper-3.4.8.tar.gz   | a23d68be (21.2 MB)

In total, 34339 object ids were changed.
{noformat}

> Binary data causes bloat in the git repository
> ----------------------------------------------
>
>                 Key: MESOS-8162
>                 URL: https://issues.apache.org/jira/browse/MESOS-8162
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Michael Park
>
> Since Git doesn't know how to handle binary files all that well, the way in 
> which
> the {{3rdparty}} directory is managed continues to bloat the size of our 
> repository.
> There is a ~100M hadoop from a long time ago that's still stored, a few ~20M
> each of older versions of Zookeeper, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to