Re: Advise needed: guideline for very big data rpms?

2021-12-27 Thread Marius Schwarz

Am 26.12.21 um 20:51 schrieb Matthew Miller:


Marius, are the different language packs updated continually and separately,
or is there one versioned set of all of them released at intervals? Is it a
case where everything is regenerated, or are additions incremental? (And do
they _replace_ or just add?)


The language files are seperate for any language. They do not update 
together.


It more the massive amount of storage space in total that worries me.

The first release would be less than 40G, that was just a size the 
entire project will reach easily, if it grows

like it did in the past.


It does seem like it'd be nice to have a way to deliver (officially from
Fedora in a way that can be shipped in Spins and containers) static files
that don't change, without needing to redownload gigabytes on upgrade. Of
course, delta RPMs are one way, but need a lot of investment in actually
working again. Ostree deltas are another — and maybe upcoming work on
container deltas could be helpful.


I don't see a way to reduce the update size, as it mostly one big file:

[marius@eve ~]$ ll /usr/share/pva/vosk-model-de-0.21/
insgesamt 28
drwxr-xr-x. 2 marius marius 4096 21. Aug 2020  am
drwxr-xr-x. 2 marius marius 4096  2. Aug 2020  conf
drwxr-xr-x. 3 marius marius 4096  9. Aug 2020  graph
drwxr-xr-x. 2 marius marius 4096 21. Aug 2020  ivector
-rw-r--r--. 1 marius marius  740 15. Sep 00:21 README
drwxr-xr-x. 2 marius marius 4096  9. Aug 2020  rescore
drwxr-xr-x. 2 marius marius 4096 15. Sep 00:14 rnnlm
[marius@eve ~]$ du -sh  /usr/share/pva/vosk-model-de-0.21/*
100M    /usr/share/pva/vosk-model-de-0.21/am
12K /usr/share/pva/vosk-model-de-0.21/conf
685M    /usr/share/pva/vosk-model-de-0.21/graph
8,2M    /usr/share/pva/vosk-model-de-0.21/ivector
4,0K    /usr/share/pva/vosk-model-de-0.21/README
2,1G    /usr/share/pva/vosk-model-de-0.21/rescore
281M    /usr/share/pva/vosk-model-de-0.21/rnnlm
[marius@eve ~]$ ll /usr/share/pva/vosk-model-de-0.21/rescore/
insgesamt 2171812
*-rw-r--r--. 1 marius marius 2115929988 14. Sep 20:58 G.carpa*
-rw-r--r--. 1 marius marius  107992138 14. Sep 20:50 G.fst



(And... I think it'd be useful in a lot of cases to be able to do dist-git
-> container without needing to build RPMs as an intermediate step. But...
that's not a thing we have now.)



As far as I understand the packaging rules, autodownloaders are not welcome,
and for security reasons, i absolutly support this.

We could downsize the problem at the beginning, because there are no 
voice commands ready for other languages, so it does not make sense to
have the language models around. I really hope the project gets a kick 
start once the first people use. it's quite easy to write a set of commands
and get it running. I suggest a nice feature in the fedora magazin about 
a working assistent for fedora.


So at the beginning, we talk about 2-4 GB for german and english. the 
pva itself  isn't that storage hungry, a mb at best. A few vosk deps 
here and there:

~100mb uncompressed maybe.

For now, I'm rebuilding the compile process against our fedora libs, so 
we can ship the required packages for kaldi & vosk. The required libs 
shipped with Fedora are older than the actual ones used by vosk devs, 
which is a problem.


With pip as source for vosk, it works as expected, but the local vosk & 
kaldi builds do not yet work :(


best regards,
Marius





___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


Re: Advise needed: guideline for very big data rpms?

2021-12-26 Thread Matthew Miller
On Sat, Dec 25, 2021 at 04:21:04PM -0500, Neal Gompa wrote:
> > > The project could offer 18 language files for a voice recognition
> > > system, which is ( unpacked ) up to 2.4 GB each and packed upto around
> > > 1.6~1.8 GB each. + 18 small ones ~50-60 MB each.
[...]
> But it's totally fine to ship such things in Fedora. We've done it
> before and we'll continue doing so. External downloaders are pretty
> much only used when something is unshippable.

In this case, I can *definitely* see the value in having these packaged
directly, because I can imagine IoT and desktop use cases where — even at
that size — voice recognition would be important to have "out-of-the-box"
without needing further Internet access.

Marius, are the different language packs updated continually and separately,
or is there one versioned set of all of them released at intervals? Is it a
case where everything is regenerated, or are additions incremental? (And do
they _replace_ or just add?)

It does seem like it'd be nice to have a way to deliver (officially from
Fedora in a way that can be shipped in Spins and containers) static files
that don't change, without needing to redownload gigabytes on upgrade. Of
course, delta RPMs are one way, but need a lot of investment in actually
working again. Ostree deltas are another — and maybe upcoming work on
container deltas could be helpful.

(And... I think it'd be useful in a lot of cases to be able to do dist-git
-> container without needing to build RPMs as an intermediate step. But...
that's not a thing we have now.)

-- 
Matthew Miller

Fedora Project Leader
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


Re: Advise needed: guideline for very big data rpms?

2021-12-25 Thread Neal Gompa
On Sat, Dec 25, 2021 at 4:12 PM Fabio Valentini  wrote:
>
> On Wed, Dec 22, 2021 at 12:33 PM Marius Schwarz  
> wrote:
> >
> > Hi,
> >
> > for a new project hopefully coming to soon to Fedora,  I like to know
> > the policy for really big data rpms.
> >
> > The project could offer 18 language files for a voice recognition
> > system, which is ( unpacked ) up to 2.4 GB each and packed upto around
> > 1.6~1.8 GB each.
> > + 18 small ones ~50-60 MB each.
> >
> > So round about, we are talking about 40 GB just for those language packs
> > just for the first release + a lot more for new updates per Fedora
> > version, and those packages grow constantly over time. Of course, users
> > do not need all of them at the same time, but they should be available.
> >
> > Is this a valid scenario for the Fedoraproject or would this be a nogo?
>
> I'm not sure if this is a good idea. For example, storage space in
> koji and especially on mirrors of Fedora repositories is already quite
> constrained, so adding tens of gigabytes to that (for every release +
> for stable/updates/testing repos) would probably explode some things
> :) Would it be possible to modify the software in question to download
> these data files on demand instead?
>

It's not unheard of that we have such RPMs in Koji. Several OSS games
are like that, and we also have data sets packaged in such a manner.

If you want to package something like this, usually it's preferred
that these things get their own source RPMs. If it's released as one
big source release, then you'll need to follow the langpack guidelines
for packaging them up with subpackages for each language. With each
language being at most ~3GB in size, people will *really* only want
the languages they need.

But it's totally fine to ship such things in Fedora. We've done it
before and we'll continue doing so. External downloaders are pretty
much only used when something is unshippable.


-- 
真実はいつも一つ!/ Always, there's only one truth!
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


Re: Advise needed: guideline for very big data rpms?

2021-12-25 Thread Fabio Valentini
On Wed, Dec 22, 2021 at 12:33 PM Marius Schwarz  wrote:
>
> Hi,
>
> for a new project hopefully coming to soon to Fedora,  I like to know
> the policy for really big data rpms.
>
> The project could offer 18 language files for a voice recognition
> system, which is ( unpacked ) up to 2.4 GB each and packed upto around
> 1.6~1.8 GB each.
> + 18 small ones ~50-60 MB each.
>
> So round about, we are talking about 40 GB just for those language packs
> just for the first release + a lot more for new updates per Fedora
> version, and those packages grow constantly over time. Of course, users
> do not need all of them at the same time, but they should be available.
>
> Is this a valid scenario for the Fedoraproject or would this be a nogo?

I'm not sure if this is a good idea. For example, storage space in
koji and especially on mirrors of Fedora repositories is already quite
constrained, so adding tens of gigabytes to that (for every release +
for stable/updates/testing repos) would probably explode some things
:) Would it be possible to modify the software in question to download
these data files on demand instead?

Fabio
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


Advise needed: guideline for very big data rpms?

2021-12-22 Thread Marius Schwarz

Hi,

for a new project hopefully coming to soon to Fedora,  I like to know 
the policy for really big data rpms.


The project could offer 18 language files for a voice recognition 
system, which is ( unpacked ) up to 2.4 GB each and packed upto around 
1.6~1.8 GB each.

+ 18 small ones ~50-60 MB each.

So round about, we are talking about 40 GB just for those language packs 
just for the first release + a lot more for new updates per Fedora 
version, and those packages grow constantly over time. Of course, users 
do not need all of them at the same time, but they should be available.


Is this a valid scenario for the Fedoraproject or would this be a nogo?

Best regards,
Marius Schwarz

___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure