Hi Andreas, On 2019-05-21 09:07, Andreas Tille wrote: > Not sure whether this is sensible to be added to the issue > tracker.
I always abuse issue track in my personal repository. > Quoting from your section "Questions Not Easy to Answer" > > > 1. Must the dataset for training a Free Model present in our archive? > Wikipedia dump is a frequently used free dataset in the computational > linguistics field, is uploading wikipedia dump to our Archive sane? > > I have no idea about the size of this kind of dump. Recently I've read > that data sets for other programs tend into the direction of 1GB. In > Debian Med I'm maintaining metaphlan2-data with 204MB which would be > even larger if there would not be some method for "data reduction" would > be used that is considered a bug (#839925) by other DDs. As pointed out by Mattias Wadenstein (thanks for the data point), the wikipedia dump is large enough to challenge the .deb format (recent threads). > 2. Should we re-train the Free Models on buildd? This is crazy. Let's > don't do that right now. > > If you ask me bothering buildd with this task is insane. However I'm > positively convinced that we should ship the training data and be able > to train the models from these. It's always good if we can do these things purely with our archive. However sometimes it's just not easy to enforce: datasets used by DL are generally large, (several hundred MB ~ several TB or even larger).

