Here are the top ten files based upon size worth considering. 100644 blob efe7111e6ca3c84e9ba6cf7622f0271c03407255 99069136 dictionary lookup/resources/lookup/umls2011ab/umls.backup 100644 blob efe7111e6ca3c84e9ba6cf7622f0271c03407255 99069136 dictionary lookup/src/main/resources/lookup/umls2011ab/umls.backup 100644 blob 7360ce26c5d37cc81dae83be1593468a4917545c 139521759 clearparser-wrapper/resources/dependency/mayo-dep.jar 100644 blob 7360ce26c5d37cc81dae83be1593468a4917545c 139521759 dependency parser/resources/dependency/mayo-dep.jar 100644 blob 7360ce26c5d37cc81dae83be1593468a4917545c 139521759 dependency parser/src/main/resources/dependency/mayo-dep.jar 100644 blob d785a5bcf608372f273600fa524c3c786fbe76f4 238248287 ctakes-3.1.0/ctakes-dependency-parser-res/src/main/resources/org/apache/ctakes/dependency/parser/models/clearparser_models.jar 100644 blob d785a5bcf608372f273600fa524c3c786fbe76f4 238248287 ctakes-dependency-parser-res/src/main/resources/org/apache/ctakes/dependency/parser/models/clearparser_models.jar 100644 blob d785a5bcf608372f273600fa524c3c786fbe76f4 238248287 dependency parser/resources/clearparser_models.jar 100644 blob 89bee2d613aba238824bab97b74df87306483192 410610240 dictionary lookup/resources/lookup/umls2011ab/umls.data 100644 blob 89bee2d613aba238824bab97b74df87306483192 410610240 dictionary lookup/src/main/resources/lookup/umls2011ab/umls.data
I came up with the top ten using the following bash command. I'm sure there is an easier way to do this, but each google search I do on git to get the largest files in the repo gives a half baked data or scripts that are difficult to follow. git branch -a --list | sed 's/*//g' | sed 's/ *//g' | sed 's/remotes//g' | sed 's/^\///g' | sed 's/->origin\/trunk//g' | xargs -I xxx bash -c 'git ls-tree -r -t -l --full-name xxx | sort -n -k 4' | sort -n -k 4 | uniq I'm not sure it is worthwhile to remove all the resources in the history, just the extremely large resources to bring the git repo down to a reasonable size. IMAT Solutions <http://imatsolutions.com> Kim Ebert Software Engineer Office: 208.971.1509 kim.eb...@imatsolutions.com <mailto:greg.hub...@imatsolutions.com> On 05/18/2015 12:21 PM, Pei Chen wrote: > One of the visions behind the *-res projects was to separate out the > resources from code. In theory, one can filter out all *-res projects > from their git repo and pull in any version of the resources from > maven central... I won't have enough bandwidth at the moment to try > it out or work on the git piece though... > --Pei > > On Thu, May 14, 2015 at 1:56 PM, Kim Ebert > <kim.eb...@perfectsearchcorp.com > <mailto:kim.eb...@perfectsearchcorp.com>> wrote: > > I've done some investigation into using / working with the git > repo for cTAKES, and I found that it is a huge. It doesn't work > well with GitHub either, as I keep running into timeouts. > > I would like to make the suggest that we remove two cTAKES build > files and the ctakes-gui-0.0.1.zip file. This takes the repo from > about 8 GB down to 1.8 GB. It is likely that the reason the git > mirror is failing is due to the large size of the repo. GitHub > will also filter out some of these vary large files, as GitHub's > max file size is 100MB. > > git filter-branch --tree-filter 'rm -rf ctakes-gui-0.0.1.zip' > origin/cTAKES-GUI-0.0.1 > git filter-branch -f --tree-filter 'rm -rf > _cTAKES_build_/cTAKES-2.5*.zip' origin/maven-sandbox > git filter-branch -f --tree-filter 'rm -rf > _cTAKES_build_/cTAKES-2.5*.zip' origin/SHARPn-cTAKES > > # Clean out unreferenced objects from repo > git -c gc.reflogExpire=0 -c gc.reflogExpireUnreachable=0 -c > gc.rerereresolved=0 \ > -c gc.rerereunresolved=0 -c gc.pruneExpire=now gc > > > It may also be helpful to remove > > ctakes-dependency-parser-res/src/main/resources/org/apache/ctakes/dependency/parser/models/clearparser_models.jar > from the git repo as well. (238,248,287 bytes) > > Thoughts? > > IMAT Solutions <http://imatsolutions.com> > Kim Ebert > Software Engineer > Office: 208.971.1509 <tel:208.971.1509> > kim.eb...@imatsolutions.com <mailto:greg.hub...@imatsolutions.com> > On 05/06/2015 01:17 PM, Steven Bethard wrote: >> Yes, I ping this issue every couple months, but no luck so far. (They >> take a look each time I ask, but haven't yet pushed a working git >> mirror for us.) >> >> Steve >> >> On Tue, May 5, 2015 at 12:09 PM, Kim Ebert >> <kim.eb...@perfectsearchcorp.com> >> <mailto:kim.eb...@perfectsearchcorp.com> wrote: >>> Ah, looks like the issue is still being looked into. >>> >>> https://issues.apache.org/jira/browse/INFRA-8553 >>> >>> On Mon, May 4, 2015 at 4:54 PM, jay vyas <jayunit100.apa...@gmail.com> >>> <mailto:jayunit100.apa...@gmail.com> >>> wrote: >>> >>>> Thanks kim. >>>> >>>> Can you file an infra issue ? >>>> >>>> they will look into it. >>>> >>>> I filed one originally >>>> On May 4, 2015 6:32 PM, "Kim Ebert" <kim.eb...@perfectsearchcorp.com> >>>> <mailto:kim.eb...@perfectsearchcorp.com> >>>> wrote: >>>> >>>>> It looks like the github hasn't been updated in a while. Any reason? >>>>> >>>>> Thanks, >>>>> >>>>> Kim >>>>> >>>>> On Tue, Feb 17, 2015 at 10:36 AM, Finan, Sean < >>>>> sean.fi...@childrens.harvard.edu >>>>> <mailto:sean.fi...@childrens.harvard.edu>> wrote: >>>>> >>>>>> Our request is for a read-only mirror. However, if it ever becomes >>>> i/o, >>>>> I >>>>>> don't know if this will have what you want, but >>>>>> http://git.apache.org/ >>>>>> Links to documentation (mostly server setup) >>>>>> http://www.apache.org/dev/git.html and a wiki (check toward middle >>>>>> and >>>>>> bottom for committer info) >>>>>> https://wiki.apache.org/general/GitAtApache >>>>>> >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] >>>>>> Sent: Tuesday, February 17, 2015 12:31 PM >>>>>> To: dev@ctakes.apache.org <mailto:dev@ctakes.apache.org> >>>>>> Subject: Re: CTAKES mirroring on github. >>>>>> >>>>>> Is there any existing resource to help people who want to use git >>>>>> understand the right workflow to contribute to ctakes? (i.e. how this >>>>>> interacts with svn repos). >>>>>> Tim >>>>>> >>>>>> >>>>>> On 02/17/2015 12:23 PM, jay vyas wrote: >>>>>>> Hi CTakes. Looks like infra finally got onto the JIRA i made for >>>>>>> this a while back. They are currently working on fixing a couple of >>>>>>> minor glitches w/ the mirroring (not showing all commits)... but >>>> there >>>>>>> now is a mirror for CTakes on github. >>>>>>> >>>>>>> >>>>>>> >>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache >>>> _ctakes&d=BQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup- >>>> IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=4sEI9mOp >>>> kTz6K-DjmNU1s8Do1TGA0_10HqJcowKpDxc&s=fNVbyXzpBLSAG6-DIjBZ1vbMp0JGaX90 >>>>>>> Lcdzg_EFVvM&e= >>>>>>> > >