Salve Nexa,
vi segnalo un nuovo thread [0] avviato dal community manager di OSI
in cui si invita la comunità ad elencare i problemi ancora presenti
nella bozza 0.0.9 della Open Source AI definition [1].
Il thread è titolato "We heard you: let’s focus on substantive
discussion". [2]
Si è rivelato piuttosto interessante, a tratti sorprendente,
soprattutto per l'analisi dei meccanismi decisionali del gruppo
di lavoro, tutt'altro che convenzionali. [3]
Al momento comunque, i problemi emersi sono:
- Data transparency: The data used to train an AI system should be
openly available, as it’s essential for understanding and improving
the model.
- Pretraining dataset distribution: The dataset used for pre-training
should also be accessible to ensure transparency and allow for
further development.
- Dataset documentation: The documentation for training datasets should
be thorough and accurate to address potential issues.
- Versioning: To maintain consistency and reproducibility, versioned
data is crucial for training AI systems.
- Open licensing: Data used to train Open Source AI systems should be
licensed under an open license.
- Reproducibility: an Open Source AI must be reproducible using the
original training data, scripts, logs and everything else used by the
original developer.
- Inherent user (in)security: without access to the whole training
data, it’s possible to plant undetectable backdoors in machine
learning Models.
- Implicit or Unspecified formal requirements: if ambiguities in the
OSAID will be solved for each candidate AI system though a formal
certificate issued by OSI, such formal requirement should be
explicitly stated in the OSAID.
- OSI as a single point of failure: since each new version of each
candidate Open Source AI system world wide should undergo to the
certification process again, this would turn OSI to a vulnerable
bottleneck in AI development, that would be the target of
unprecedented lobbying from the industry.
- Open Washing AI: any definition that a black box could pass would
both damage the credibility the whole open source ecosystem, and open
a huge loophole in the european normative (the AI Act).
Tutti i problemi in questione sono ampiamente documentati nel thread o
negli altri thread collegati, tuttavia se avete osservato altri problemi
o se voleste commentare su di essi, vi suggerisco di proporli al più
presto.
Giacomo
PS: Guarda caso, tutti i problemi emersi sono risolvibili richiedendo
la disponibilità dei dati di training, come proposto nel thread
chiuso dallo stesso comunity manager dopo avermi silenziato [4]
[0]
https://discuss.opensource.org/t/we-heard-you-lets-focus-on-substantive-discussion/589
[1] https://opensource.org/deepdive/drafts
[2] dice proprio "ascoltare", ma alcuni utenti sono ancora silenziati
[3]
https://discuss.opensource.org/t/we-heard-you-lets-focus-on-substantive-discussion/589/9
[4]
https://discuss.opensource.org/t/rfc-separating-concerns-between-source-data-and-processing-information/568