Hey Mike,

Thank you very much for the suggestion, this worked perfectly. I'll definitely 
be stealing this code for future reference.

Kind regards,
Alan.
________________________________
From: Mike Smith <grimbo...@gmail.com>
Sent: 09 March 2021 11:14
To: Murphy, Alan E <a.mur...@imperial.ac.uk>
Cc: bioc-devel@r-project.org <bioc-devel@r-project.org>
Subject: Re: [Bioc-devel] Removal of large items in git history - BiocCheck 
warning

I've used something like this approach in the past.  All the normal caveats 
about making sure you've got a backup apply.

Find the names of largest objects in the pack file (not necessarily in size 
order).  In this case they're almost all .rda files.

git rev-list --objects --all | grep -f <(git verify-pack -v 
.git/objects/pack/*.idx| sort -k 3 -n | cut -f 1 -d " " | tail -15)

e63fb55738f4d6643939863ec7799776d4b161c5 EWCE.html
f67b528ec5e029fbeb45c2ff90d619de0d7ae4c0 articles/EWCE.html
b871cbacac1c1fe1b372a8eca9f7c68122fc4bf4 
articles/EWCE_files/figure-html/unnamed-chunk-21-1.png
ae0e4cda88322aaff0b064136c84096d16dc219f reference/ewce.plot-1.png
8946eeb7255c328676a61da71276a29002e34d1f data/all_hgnc.rda
60814dfe9cbf3cb77b846a9fc0270bc7cc00d50c data/all_hgnc_wtEnsembl.rda
d152a56e7290abb06eab1112910a499145dbd3e1 data/all_mgi.rda
7075962fb2ccc78b826c7fc1823d0e3d5e5d7b01 data/all_mgi_wtEnsembl.rda
5d7d0a395c104ad39f105ad85c7a84663e0e6002 
data/ensembl_transcript_lengths_GCcontent.rda
100a7fa8df12deb1803a437b442c0897811916df data/mgi_synonym_data.rda
f890d2bbd63b7ecff94e4917b6b7188399659221 data/mouse_to_human_homologs.rda
fddddd7022bc96d24d75cf71d65c097d84bade88 data/tt_alzh.rda
98aba69ade5c09a2100248c963bb5397860ae089 data/tt_alzh_BA36.rda
0f006997c7a45a5647dd5ce21be650d6c197ea29 data/tt_alzh_BA44.rda
67b2d63f55531f85ece47e298213fd25cacdaa01 data/cortex_mrna.rda

Filter files with the .rda extension.  I guess you should be more careful here 
if there are rda files you want to retain, but I don't see any in the main 
branch on Github.  I get a pretty scary looking warning from git, but it seems 
to have worked out ok for me in the past.

git filter-branch --index-filter 'git rm --cached --ignore-unmatch *.rda' -- 
--all

Apply the removal to the repo.

rm -Rf .git/refs/original
rm -Rf .git/logs/
git gc --aggressive --prune=now

Check the new size of the pack folder.

du -h .git/objects/pack
3,9M .git/objects/pack

You could probably apply this approach to remove the large .html files too, but 
it looks like they're part of the pkgdown site for your package so I imagine 
you want to keep them.

Mike



On Tue, 9 Mar 2021 at 10:09, Murphy, Alan E 
<a.mur...@imperial.ac.uk<mailto:a.mur...@imperial.ac.uk>> wrote:
Hi both,

Thank you for your suggestions. Yes, I am still having problems with the size 
of my git history in the EWCE package. To clarify, I have already tried the BFG 
cleaner to no avail even when I set the max limit to 1 MB (see my first email 
for details).

The issue is that a .git/objects/pack/ file is still greater than the allotted 
5MB, it appears to be 8.9MB in size. As mentioned, I have used the BFG cleaner 
and yet this still remains too large. If anyone has suggestions on how else I 
could reduce this size that would be great.

@Nitesh Turaga<mailto:nturaga.b...@gmail.com<mailto:nturaga.b...@gmail.com>> 
how would I go about checking (and removing?) hidden files from the 
.git/objects/pack history?

Kind regards,
Alan.
________________________________
From: stefano <mangiolastef...@gmail.com<mailto:mangiolastef...@gmail.com>>
Sent: 08 March 2021 22:18
To: Nitesh Turaga <nturaga.b...@gmail.com<mailto:nturaga.b...@gmail.com>>
Cc: Murphy, Alan E <a.mur...@imperial.ac.uk<mailto:a.mur...@imperial.ac.uk>>; 
bioc-devel@r-project.org<mailto:bioc-devel@r-project.org> 
<bioc-devel@r-project.org<mailto:bioc-devel@r-project.org>>
Subject: Re: [Bioc-devel] Removal of large items in git history - BiocCheck 
warning


This email from mangiolastef...@gmail.com<mailto:mangiolastef...@gmail.com> 
originates from outside Imperial. Do not click on links and attachments unless 
you recognise the sender. If you trust the sender, add them to your safe 
senders list<https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email 
stamping for this address.



Hello,

you can use  bfg-repo-cleaner  ,

have a read to this document, in the section "eliminate big files from repo"

https://docs.google.com/document/d/1jxg7KCMQq3kiCcvodQk9JgtU51LqczOwLit1gHiTP4Q/edit?usp=sharing


Best wishes.

Stefano



Stefano Mangiola | Postdoctoral fellow

Papenfuss Laboratory

The Walter Eliza Hall Institute of Medical Research

+61 (0)466452544


Il giorno mar 9 mar 2021 alle ore 09:11 Nitesh Turaga 
<nturaga.b...@gmail.com<mailto:nturaga.b...@gmail.com><mailto:nturaga.b...@gmail.com<mailto:nturaga.b...@gmail.com>>>
 ha scritto:
Hi Alan,

Did you manage to solve this?

There seems to be objects in your git repo which are bigger than the size which 
is required by Bioconductor for a software package. Please check hidden files 
as well.

One test you can do is, to clone your package from github and see how much MB 
are downloaded to this new location. This is a good test to check which files 
are still larger than the limit.

Best,

Nitesh

On 3/4/21, 11:19 AM, "Bioc-devel on behalf of Murphy, Alan E" 
<bioc-devel-boun...@r-project.org<mailto:bioc-devel-boun...@r-project.org><mailto:bioc-devel-boun...@r-project.org<mailto:bioc-devel-boun...@r-project.org>>
 on behalf of 
a.mur...@imperial.ac.uk<mailto:a.mur...@imperial.ac.uk><mailto:a.mur...@imperial.ac.uk<mailto:a.mur...@imperial.ac.uk>>>
 wrote:

    Hi all,

    I am working on the development of 
EWCE<https://github.com/NathanSkene/EWCE> for submission to Bioconductor. I 
have removed some large objects from the package and moved them to a separate 
ExperimentHub package however, after their removal, I got a BiocCheck large 
file warning.

    To deal with the data stored in git history, I followed the instructions to 
use the BFG cleaner with the max size set to 5MB. This appeared to work and 
some things were removed but yet I still get the warning below:

    $warning[1] "The following files are over 5MB in size: 
'.git/objects/pack/pack-366a7ab7a2ba4e656f3a9f3f1408be7ab9f41303.pack'"

    If I try to rerun the BFG cleaner I get the following output:


    Warning : no large blobs matching criteria found in packfiles - does the 
repo need to be packed?

    I have tried two different methods to using the BFG cleaner, one from 
BFG<https://rtyley.github.io/bfg-repo-cleaner/> themselves and one from 
Bioconductor<https://bioconductor.org/developers/how-to/git/remove-large-data/>.
 I have also completed all steps in both including the prune step:


    git reflog expire --expire=now --all && git gc --prune=now --aggressive

    I have even tried reducing the max from 5MB to 1MB but still nothing seems 
to be left eve at that size. Does anyone know of another way to sort this issue 
or have any clue what I may be doing wrong?

    Kind regards,
    Alan.

    Alan Murphy
    Bioinformatician
    Neurogenomics lab
    UK Dementia Research Institute
    Imperial College London

        [[alternative HTML version deleted]]

    _______________________________________________
    
Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org><mailto:Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>>
 mailing list
    https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________
Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org><mailto:Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>>
 mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to