Re: [Apertium-stuff] Separate Corpus Repos

Nick Howell Sun, 15 Dec 2019 13:52:33 -0800

Apologies for the ugly e-mail; I am traveling.

On Thursday, December 12, 2019, Francis Tyers <fty...@prompsit.com> wrote:
> El 2019-12-12 09:16, Kevin Brubeck Unhammer escribió:
>>
>> Tino Didriksen <m...@tinodidriksen.com>
>> čálii:
>>
>>> I would like for corpus and other indirect data to go in separate
>>> repositories. Basically, if the data is not used during the build, it
>>> should go elsewhere.
>>
>> What if it's used during `make test`?
>>
>> By the same argument, should we remove scripts that are used during
>> development, but not required for build (stuff that is kept in the dev/
>> subfolder)? If we get too strict on the requirement of "only things
>> necessary for build", people may start just not checking in useful
>> scripts, which to me seems worse. And it's already quite annoying having
>> to check out three repos just to work on one language pair; if
>> development depends on corpora repos, you have not just three, but *six*
>> places where you can forget to git push, or where you have to compare
>> git logs to review changes.
>>
>>> We need corpus data under Apertium's control so that we don't rely on
3rd
>>> parties. However, bundling this data in the languages' and pairs' repos
>>> means that those repos grow unbounded, especially when the data is
changed.
>>
>> I agree that "big" data shouldn't be in the regular repos, since it
>> slows down checking them out. But less than a few megabytes of text
>> won't make much difference to a repo with tens of MB's of .dix entries.
>>
>>> It also messes up the changelog. I use a script to generate AUTHORS from
>>> the changelog, because nobody keeps that up to date. But this gets
muddied
>>> when unnecessary data is in the repo.
>>
>> In general I would want to include annotators as authors, though I can
>> imagine situations where it's not clear-cut, e.g. where the dataset is
>> too large or is not quite relevant for developing the rest of the repo.
>>
>> I think having corpus-xxx and corpus-xxx-yyy repos could be a good
>> thing, but I don't think we should have a hard requirement of moving
>> data over there, especially if the data is useful during testing and
>> development. I do think it makes sense to move larger corpora out, for
>> faster cloning.
>>
>
> I like the idea of not having large corpora in the git repos for
> languages and language pairs.


I feel pretty strongly about this: for the time being, git handles failed
clones by restarting from the beginning. If you have a poor internet
connection, git-clone'ing a massive repo can be nearly impossible, and this
disproportionately harms many of the very communities we want to help.

> I'm not sure if corpora-xxx in the github is the right way to go though.
>
> I think it would be better to store them on a web server and either:
>
> 1) Have apertium-xxx/text that has a script that will download the corpus
>     from the server and a gitignore to not have it in the repo.
> 2) Use something like git-annex (this is bit more involved)

git-annex is essentially designed for exactly our use-case. Github and
Gitlab natively speak a protocol called "git LFS" which git-annex supports.
So I would be highly supportive of moving in that direction.

I would be happy to help put together a proposal for what that would look
like, but probably not before the end of the month. Potential problems I
can see with such a plan are:
- git-annex has a heavy build dependency set (Haskell)
- git-annex depends on stable hashes of the corpus data
- git-annex packages can be out-of-date outside of debian

These all have mitigations, which I can describe if there is more interest
in a proposal.

>
> It would be great to e.g. keep updated cleaned versions of Wikipedia
dumps,

> and also be able to store non-redistributable stuff.

About this, I don't understand. Are there corpora we are storing which we
can distribute, but others cannot? Surely we are not distributing without a
license?

Cheers,
Nick

>
> I can expand a bit on this proposal if necessary.
>
> Fran
>
>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Separate Corpus Repos

Reply via email to