Hi Mo, On Sat, Jun 08, 2019 at 10:07:13PM -0700, Mo Zhou wrote: > Hi Osamu, > > On 2019-06-08 18:43, Osamu Aoki wrote: > >> This draft is conservative and overkilling, and currently > >> only focus on software freedom. That's exactly where we > >> start, right? > > > > OK but it can't be where we end-up-with. > > That's why I said the two words "conservative" and "overkilling". > In my blueprint we can actually loosen these restrictions bit > by bit with further case study.
Yes, we agree here! > > Before scientific "deep learning" data, we already have practical "deep > > learning" data in our archive. > > Thanks for pointing them out. They are good case study > for me to revise the DL-Policy. > > > Please note one of the most popular Japanese input method mozc will be > > kicked out from main as a starter if we start enforcing this new > > guideline. > > I'm in no position of irresponsibly enforcing an experimental > policy without having finished enough case study. I noticed it since you were thinking deep enough but I saw some danger for other people to make decision too quickly based on the "Labeling". Please check our history on the following GRs: https://www.debian.org/vote/2004/vote_003 https://www.debian.org/vote/2006/vote_004 We are stack with "Further discussion" at this moment. > >> Specifically, I defined 3 types of pre-trained machine > >> learning models / deep learning models: > >> > >> Free Model, ToxicCandy Model. Non-free Model > >> > >> Developers who'd like to touch DL software should be > >> cautious to the "ToxicCandy" models. Details can be > >> found in my draft. > > > > With a labeling like "ToxicCandy Model" for the situation, it makes bad > > impression on people and I am afraid people may not be make rational > > decision. Is this characterization correct and sane one? At least, > > it looks to me that this is changing status-quo of our policy and > > practice severely. So it is worth evaluating idea without labeling. > > My motivation for the naming "ToxicCandy" is pure: to warn developers > about this special case as it may lead to very difficult copyright > or software freedom questions. I admit that this name looks not > quite friendly. Maybe "SemiFree" look better? Although I understand the intent of "SemiFree" or "Tainted" (by Yao), I don't think these are a good choice. We need to draw a line between FREE(=main) and NON-FREE(non-free) as a organization. I think there are 2 FREE models we are allowing for "main" as the current practice. * Pure Free Model from pure free pre-train data only * Sanitized Free Model from free and non-free mixed pre-train data And, we don't allow Non-Free Model in "main" Question is when do you call it "sanitized" (or "distilled") to be clean enough to qualify for "main" ;-) > > As long as the "data" comes in the form which allows us to modify it and > > re-train it to make it better with a set of free software tools to do it, > > we shouldn't make it non-free, for sure. That is my position and I > > think this was what we operated as the project. We never asked how they > > are originally made. The touchy question is how easy it should be to > > modify and re-train, etc. > > > > Let's list analogy cases. We allow a photo of something on our archive > > as wallpaper etc. We don't ask object of photo or tool used to make it > > to be FREE. Debian logo is one example which was created by Photoshop > > as I understand. Another analogy to consider is how we allow > > independent copyright and license for the dictionary like data which > > must have processed previous copyrighted (possibly non-free) texts by > > human brain and maybe with some script processing. Packages such as > > opendict, *spell-*, dict-freedict-all, ... are in main. ... > Thank you Osamu. These cases inspired me on finding a better > balance point for DL-Policy. I'll add these cases to the case > study section, and I'm going to add the following points to DL-Policy: > > 1. Free datasets used to train FreeModel are not required to upload > to our main section, for example those Osamu mentioned and wikipedia > dump. We are not scientific data archiving organization and these > data will blow up our infra if we upload too much. > > 2. It's not required to re-train a FreeModel with our infra, because > the outcome/cost ratio is impractical. The outcome is nearly zero > compared to directly using a pre-trained FreeModel, while the cost > is increased carbon dioxide in our atmosphere and wasted developer > time. (Deep learning is producing much more carbon dioxide than we > thought). > > For classical probablistic graph models such as MRF or the mentioned > CRF, the training process might be trivial, but re-training is still > not required. ... but re-training is highly desirable in line with the spirit of the free software. > For SemiFreeModel I still hesitate to make any decision. Once we let SanitizedModel > them enter the main section there will be many unreproducible > or hard-to-reproduce but surprisingly "legal" (in terms of DL-Policy) > files. Maybe this case is to some extent similar to artworks and fonts. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ YES. > Further study needed. And it's still not easy to find a balance point > for SemiFreeModel between usefulness and freedom. SanitizedModel Let's use SanitizedModel to be neutral. We need to have some guideline principle for this sanitization process. (I don't have an answer now) This sanitization mechanism shouldn't be used to include obfuscated binary blob equivalents. It's worse than FIRMWARE case since it runs on the same CPU as the program code. Although "Further Discussion" was the outcome, B in https://www.debian.org/vote/2006/vote_004 is worth looking at: Strongly recommends that all non-programmatic works distribute the form that the copyright holder or upstream developer would actually use for modification. Such forms need not be distributed in the orig.tar.gz (unless required by license) but should be made available on upstream websites and/or using Debian project resources. Please note this is "Strongly recommends ... should be made available..." and not "must be made available ...". Aside from Policy/Guideline for FREE/NON-FREE discussion, we also need to address for the spirit of the reproducible build. It is nice to have checking mechanism for the validity and health of these MODELs. I know one of the Japanese keyboard input method "Anthy" is suffering some regression in the upcoming release. The fix was found too late so I uploaded to experimental since it contained too many changes while impact was subtle. If we had a test suite with numerical score outputs, we could have detected such regressions by the upstream. It may be unrealistic to aim for exact match for such probabilistic model but objectively traceable measure is very desirable to have. Osamu