Re: Advise needed: guideline for very big data rpms?
Am 26.12.21 um 20:51 schrieb Matthew Miller: Marius, are the different language packs updated continually and separately, or is there one versioned set of all of them released at intervals? Is it a case where everything is regenerated, or are additions incremental? (And do they _replace_ or just add?) The language files are seperate for any language. They do not update together. It more the massive amount of storage space in total that worries me. The first release would be less than 40G, that was just a size the entire project will reach easily, if it grows like it did in the past. It does seem like it'd be nice to have a way to deliver (officially from Fedora in a way that can be shipped in Spins and containers) static files that don't change, without needing to redownload gigabytes on upgrade. Of course, delta RPMs are one way, but need a lot of investment in actually working again. Ostree deltas are another — and maybe upcoming work on container deltas could be helpful. I don't see a way to reduce the update size, as it mostly one big file: [marius@eve ~]$ ll /usr/share/pva/vosk-model-de-0.21/ insgesamt 28 drwxr-xr-x. 2 marius marius 4096 21. Aug 2020 am drwxr-xr-x. 2 marius marius 4096 2. Aug 2020 conf drwxr-xr-x. 3 marius marius 4096 9. Aug 2020 graph drwxr-xr-x. 2 marius marius 4096 21. Aug 2020 ivector -rw-r--r--. 1 marius marius 740 15. Sep 00:21 README drwxr-xr-x. 2 marius marius 4096 9. Aug 2020 rescore drwxr-xr-x. 2 marius marius 4096 15. Sep 00:14 rnnlm [marius@eve ~]$ du -sh /usr/share/pva/vosk-model-de-0.21/* 100M /usr/share/pva/vosk-model-de-0.21/am 12K /usr/share/pva/vosk-model-de-0.21/conf 685M /usr/share/pva/vosk-model-de-0.21/graph 8,2M /usr/share/pva/vosk-model-de-0.21/ivector 4,0K /usr/share/pva/vosk-model-de-0.21/README 2,1G /usr/share/pva/vosk-model-de-0.21/rescore 281M /usr/share/pva/vosk-model-de-0.21/rnnlm [marius@eve ~]$ ll /usr/share/pva/vosk-model-de-0.21/rescore/ insgesamt 2171812 *-rw-r--r--. 1 marius marius 2115929988 14. Sep 20:58 G.carpa* -rw-r--r--. 1 marius marius 107992138 14. Sep 20:50 G.fst (And... I think it'd be useful in a lot of cases to be able to do dist-git -> container without needing to build RPMs as an intermediate step. But... that's not a thing we have now.) As far as I understand the packaging rules, autodownloaders are not welcome, and for security reasons, i absolutly support this. We could downsize the problem at the beginning, because there are no voice commands ready for other languages, so it does not make sense to have the language models around. I really hope the project gets a kick start once the first people use. it's quite easy to write a set of commands and get it running. I suggest a nice feature in the fedora magazin about a working assistent for fedora. So at the beginning, we talk about 2-4 GB for german and english. the pva itself isn't that storage hungry, a mb at best. A few vosk deps here and there: ~100mb uncompressed maybe. For now, I'm rebuilding the compile process against our fedora libs, so we can ship the required packages for kaldi & vosk. The required libs shipped with Fedora are older than the actual ones used by vosk devs, which is a problem. With pip as source for vosk, it works as expected, but the local vosk & kaldi builds do not yet work :( best regards, Marius ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: Advise needed: guideline for very big data rpms?
On Sat, Dec 25, 2021 at 04:21:04PM -0500, Neal Gompa wrote: > > > The project could offer 18 language files for a voice recognition > > > system, which is ( unpacked ) up to 2.4 GB each and packed upto around > > > 1.6~1.8 GB each. + 18 small ones ~50-60 MB each. [...] > But it's totally fine to ship such things in Fedora. We've done it > before and we'll continue doing so. External downloaders are pretty > much only used when something is unshippable. In this case, I can *definitely* see the value in having these packaged directly, because I can imagine IoT and desktop use cases where — even at that size — voice recognition would be important to have "out-of-the-box" without needing further Internet access. Marius, are the different language packs updated continually and separately, or is there one versioned set of all of them released at intervals? Is it a case where everything is regenerated, or are additions incremental? (And do they _replace_ or just add?) It does seem like it'd be nice to have a way to deliver (officially from Fedora in a way that can be shipped in Spins and containers) static files that don't change, without needing to redownload gigabytes on upgrade. Of course, delta RPMs are one way, but need a lot of investment in actually working again. Ostree deltas are another — and maybe upcoming work on container deltas could be helpful. (And... I think it'd be useful in a lot of cases to be able to do dist-git -> container without needing to build RPMs as an intermediate step. But... that's not a thing we have now.) -- Matthew Miller Fedora Project Leader ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: Advise needed: guideline for very big data rpms?
On Sat, Dec 25, 2021 at 4:12 PM Fabio Valentini wrote: > > On Wed, Dec 22, 2021 at 12:33 PM Marius Schwarz > wrote: > > > > Hi, > > > > for a new project hopefully coming to soon to Fedora, I like to know > > the policy for really big data rpms. > > > > The project could offer 18 language files for a voice recognition > > system, which is ( unpacked ) up to 2.4 GB each and packed upto around > > 1.6~1.8 GB each. > > + 18 small ones ~50-60 MB each. > > > > So round about, we are talking about 40 GB just for those language packs > > just for the first release + a lot more for new updates per Fedora > > version, and those packages grow constantly over time. Of course, users > > do not need all of them at the same time, but they should be available. > > > > Is this a valid scenario for the Fedoraproject or would this be a nogo? > > I'm not sure if this is a good idea. For example, storage space in > koji and especially on mirrors of Fedora repositories is already quite > constrained, so adding tens of gigabytes to that (for every release + > for stable/updates/testing repos) would probably explode some things > :) Would it be possible to modify the software in question to download > these data files on demand instead? > It's not unheard of that we have such RPMs in Koji. Several OSS games are like that, and we also have data sets packaged in such a manner. If you want to package something like this, usually it's preferred that these things get their own source RPMs. If it's released as one big source release, then you'll need to follow the langpack guidelines for packaging them up with subpackages for each language. With each language being at most ~3GB in size, people will *really* only want the languages they need. But it's totally fine to ship such things in Fedora. We've done it before and we'll continue doing so. External downloaders are pretty much only used when something is unshippable. -- 真実はいつも一つ!/ Always, there's only one truth! ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: Advise needed: guideline for very big data rpms?
On Wed, Dec 22, 2021 at 12:33 PM Marius Schwarz wrote: > > Hi, > > for a new project hopefully coming to soon to Fedora, I like to know > the policy for really big data rpms. > > The project could offer 18 language files for a voice recognition > system, which is ( unpacked ) up to 2.4 GB each and packed upto around > 1.6~1.8 GB each. > + 18 small ones ~50-60 MB each. > > So round about, we are talking about 40 GB just for those language packs > just for the first release + a lot more for new updates per Fedora > version, and those packages grow constantly over time. Of course, users > do not need all of them at the same time, but they should be available. > > Is this a valid scenario for the Fedoraproject or would this be a nogo? I'm not sure if this is a good idea. For example, storage space in koji and especially on mirrors of Fedora repositories is already quite constrained, so adding tens of gigabytes to that (for every release + for stable/updates/testing repos) would probably explode some things :) Would it be possible to modify the software in question to download these data files on demand instead? Fabio ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Advise needed: guideline for very big data rpms?
Hi, for a new project hopefully coming to soon to Fedora, I like to know the policy for really big data rpms. The project could offer 18 language files for a voice recognition system, which is ( unpacked ) up to 2.4 GB each and packed upto around 1.6~1.8 GB each. + 18 small ones ~50-60 MB each. So round about, we are talking about 40 GB just for those language packs just for the first release + a lot more for new updates per Fedora version, and those packages grow constantly over time. Of course, users do not need all of them at the same time, but they should be available. Is this a valid scenario for the Fedoraproject or would this be a nogo? Best regards, Marius Schwarz ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure