I would like for corpus and other indirect data to go in separate
repositories. Basically, if the data is not used during the build, it
should go elsewhere.

We need corpus data under Apertium's control so that we don't rely on 3rd
parties. However, bundling this data in the languages' and pairs' repos
means that those repos grow unbounded, especially when the data is changed.

It also messes up the changelog. I use a script to generate AUTHORS from
the changelog, because nobody keeps that up to date. But this gets muddied
when unnecessary data is in the repo.

On the other hand, I recognize that having some corpus data locally is
easier to develop with. I just don't think that outweighs the downsides.
Hence, asking for opinions.

Concretely, I propose that data for apertium-xxx that isn't used during
build should go into repo corpus-xxx, and similarly for apertium-xxx-yyy
into corpus-xxx-yyy. This would make it easy for helpers like apertium-get
to automatically check out the related corpus data.

-- Tino Didriksen
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to