Re: [Apertium-stuff] Separate Corpus Repos

Tommi A Pirinen Wed, 11 Dec 2019 08:40:18 -0800

[Few comments below:]

On Wed, Dec 11, 2019 at 04:26:02PM +0100, Tino Didriksen wrote:


> I would like for corpus and other indirect data to go in separate
> repositories. Basically, if the data is not used during the build, it
> should go elsewhere.

Some corpus data is used heavily for testing and development, like
mentioned below.

If we had more statistical / neural components in apertium, the corpus
data could be used to build some of the stuff, I've done this in some
other projects.

> We need corpus data under Apertium's control so that we don't rely on 3rd
> parties. 

This is a point that I strongly agree with, for example I used to use 
http://www.unilang.org/resources.php?category=stories in tutorials to
give people texts but it's 404 atm. Similar problems appear with a lot
of other free corpora, e.g. project gutenberg is geoblocked here...

> However, bundling this data in the languages' and pairs' repos
> means that those repos grow unbounded, especially when the data is changed.

This is a problem I agree with, and git is not good for larger texts in
general. Ideally I would probably propose limiting the bundled "corpora"
to few very precisely cherry-picked texts for testing and development.

> It also messes up the changelog. I use a script to generate AUTHORS from
> the changelog, because nobody keeps that up to date. But this gets muddied
> when unnecessary data is in the repo.

In few morphological analysers I've been developing recently, I'd rather
attribute larger contribution to the native informants who've annotated
texts or translations, than for my engineering efforts, I think it's ok
to have annotators as authors.

> On the other hand, I recognize that having some corpus data locally is
> easier to develop with. I just don't think that outweighs the downsides.
> Hence, asking for opinions.

I think a lot of development workflows do indeed use texts and some
scripts and all. I would also use this for automated testing including
ci.

> Concretely, I propose that data for apertium-xxx that isn't used during
> build should go into repo corpus-xxx, and similarly for apertium-xxx-yyy
> into corpus-xxx-yyy. This would make it easy for helpers like apertium-get
> to automatically check out the related corpus data.

I think this would also be doable option, in the end the testing and
development and ci can be scripted to fetch one extra repo with
relatively small effort too. 

Perhaps one of the arguments to keep at least a selection of annotated
and hand-disambiguated gold-corpora in teh same repo with dictionary and
grammar is to keep them strongly in sync and related.

-- 
Doktor Tommi A Pirinen, Computational Linguist,
<https://flammie.github.io/purplemonkeydishwasher/>, Universität
Hamburg, Hamburger Zentrum für Sprachkorpora <http://hzsk.de>. CLARIN-D
Entwickler.  President of ACL SIGUR SIG for Uralic languages
<http://gtweb.uit.no/sigur/>.
I tend to follow inline-posting style in desktop e-mail messages.

signature.asc
Description: PGP signature

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Separate Corpus Repos

Reply via email to