Apologies for the ugly e-mail; I am traveling. On Thursday, December 12, 2019, Francis Tyers <fty...@prompsit.com> wrote: > El 2019-12-12 09:16, Kevin Brubeck Unhammer escribió: >> >> Tino Didriksen <m...@tinodidriksen.com> >> čálii: >> >>> I would like for corpus and other indirect data to go in separate >>> repositories. Basically, if the data is not used during the build, it >>> should go elsewhere. >> >> What if it's used during `make test`? >> >> By the same argument, should we remove scripts that are used during >> development, but not required for build (stuff that is kept in the dev/ >> subfolder)? If we get too strict on the requirement of "only things >> necessary for build", people may start just not checking in useful >> scripts, which to me seems worse. And it's already quite annoying having >> to check out three repos just to work on one language pair; if >> development depends on corpora repos, you have not just three, but *six* >> places where you can forget to git push, or where you have to compare >> git logs to review changes. >> >>> We need corpus data under Apertium's control so that we don't rely on 3rd >>> parties. However, bundling this data in the languages' and pairs' repos >>> means that those repos grow unbounded, especially when the data is changed. >> >> I agree that "big" data shouldn't be in the regular repos, since it >> slows down checking them out. But less than a few megabytes of text >> won't make much difference to a repo with tens of MB's of .dix entries. >> >>> It also messes up the changelog. I use a script to generate AUTHORS from >>> the changelog, because nobody keeps that up to date. But this gets muddied >>> when unnecessary data is in the repo. >> >> In general I would want to include annotators as authors, though I can >> imagine situations where it's not clear-cut, e.g. where the dataset is >> too large or is not quite relevant for developing the rest of the repo. >> >> I think having corpus-xxx and corpus-xxx-yyy repos could be a good >> thing, but I don't think we should have a hard requirement of moving >> data over there, especially if the data is useful during testing and >> development. I do think it makes sense to move larger corpora out, for >> faster cloning. >> > > I like the idea of not having large corpora in the git repos for > languages and language pairs.
I feel pretty strongly about this: for the time being, git handles failed clones by restarting from the beginning. If you have a poor internet connection, git-clone'ing a massive repo can be nearly impossible, and this disproportionately harms many of the very communities we want to help. > I'm not sure if corpora-xxx in the github is the right way to go though. > > I think it would be better to store them on a web server and either: > > 1) Have apertium-xxx/text that has a script that will download the corpus > from the server and a gitignore to not have it in the repo. > 2) Use something like git-annex (this is bit more involved) git-annex is essentially designed for exactly our use-case. Github and Gitlab natively speak a protocol called "git LFS" which git-annex supports. So I would be highly supportive of moving in that direction. I would be happy to help put together a proposal for what that would look like, but probably not before the end of the month. Potential problems I can see with such a plan are: - git-annex has a heavy build dependency set (Haskell) - git-annex depends on stable hashes of the corpus data - git-annex packages can be out-of-date outside of debian These all have mitigations, which I can describe if there is more interest in a proposal. > > It would be great to e.g. keep updated cleaned versions of Wikipedia dumps, > and also be able to store non-redistributable stuff. About this, I don't understand. Are there corpora we are storing which we can distribute, but others cannot? Surely we are not distributing without a license? Cheers, Nick > > I can expand a bit on this proposal if necessary. > > Fran > > > _______________________________________________ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff