I would like for corpus and other indirect data to go in separate repositories. Basically, if the data is not used during the build, it should go elsewhere.
We need corpus data under Apertium's control so that we don't rely on 3rd parties. However, bundling this data in the languages' and pairs' repos means that those repos grow unbounded, especially when the data is changed. It also messes up the changelog. I use a script to generate AUTHORS from the changelog, because nobody keeps that up to date. But this gets muddied when unnecessary data is in the repo. On the other hand, I recognize that having some corpus data locally is easier to develop with. I just don't think that outweighs the downsides. Hence, asking for opinions. Concretely, I propose that data for apertium-xxx that isn't used during build should go into repo corpus-xxx, and similarly for apertium-xxx-yyy into corpus-xxx-yyy. This would make it easy for helpers like apertium-get to automatically check out the related corpus data. -- Tino Didriksen
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff