Hi folks, I hope you'll excuse the entry of a rather junior newcomer, but it seems to me that there might be some misunderstandings about the nature of the models.
In particular, unless I am mistaken, the 'binaries' in question are jars filled with human-readable ASCII files. Developers are therefore free to peruse these files to see how the model is put together, and could even make modifications to the files if they desire (though as mentioned earlier, this would be quite foolish). The sticky point regarding the training data for the models is almost certainly that the data consists of medical records protected by HIPAA. For example, the Mayo data used for the sentence detector model includes Mayo in-house programmatically de-identified patient records. This kind of data is generally never released without a DUA -- I'm not familiar with any major de-id clinical record datasets that are available without a DUA as the liability (and, frankly, the moral concern) deriving from the risk of a third party re-ID'ing the data is simply too great. This being said, anybody who uses cTAKES must have a corpus that could be used to train new models, since the use of cTAKES requires input data. Developers who wish to contribute modifications therefore can test model generation and use on their own data before contributing. Problems remain with the issue of the performance of any contributed modifications when training on the 'official' non-distributed datasets, and it is true that contributors would not be able to test this a priori. I imagine there are certainly committers with access to the datasets who could provide feedback, but I suspect this issue is less a licensing issue and more an issue with the nature of how cTAKES works. All users and developers need to be cognizant of the applicability of the distributed models to their own datasets, and I would bet that the models are not highly performant on most other institutional medical record corpora, and that any contributions to code involving the models would have a similar issue. Given this fact, it seems to me that maybe the models should just be considered the same as, say, an image file distributed with code as a placeholder. Users are free to replace the placeholder with something that works for them, and the placeholder is not intended to be something that will work for anyone or that would make it into any production distribution. Hopefully at least some of this makes sense and his helpful! Karthik -- Karthik Sarma UCLA Medical Scientist Training Program Class of 20?? Member, UCLA Medical Imaging & Informatics Lab Member, CA Delegation to the House of Delegates of the American Medical Association [email protected] gchat: [email protected] linkedin: www.linkedin.com/in/ksarma On Wed, Jan 23, 2013 at 8:20 AM, Benson Margulies <[email protected]>wrote: > So, nothing derived from those undisclosable sources can be in the > source package: period. > > As for the binaries, I am personally uncomfortable if you cannot even > create a private download of those sources accessible to community > members. However, I don't know how to translate my personal discomfort > into policy. I will endeavour to get some advice. > > > On Wed, Jan 23, 2013 at 10:36 AM, Masanz, James J. > <[email protected]> wrote: > > One goal is to have a binary that contains all resources, which can be > used to install cTAKES on a system that does not have an internet > connection. > > For now we can focus on a first Apache release that doesn't meet that > goal, while pursuing the question with legal. > > If legal says we can't do have that kind of binary here, then in the > future we can consider if we will host such a binary on a different site. > > > > Regards, > > James Masanz > > > >> -----Original Message----- > >> From: [email protected] > >> [mailto:[email protected] > ] > >> On Behalf Of Chris Douglas > >> Sent: Wednesday, January 23, 2013 3:45 AM > >> To: [email protected] > >> Cc: [email protected] > >> Subject: Re: [VOTE] Apache cTAKES 3.0.0-incubating RC5 release > >> > >> On Wed, Jan 23, 2013 at 12:47 AM, Jörn Kottmann <[email protected]> > >> wrote: > >> > No, the OpenNLP did not have any discussion about it with legal. We > >> > just came to the conclusion that its not worth spending time on these > >> > issues, when we can instead produce our own training data which is > >> > compatible with the Apache license. > >> > >> Understood. Are the compatible training data synthetic? Would you > >> recommend a similar course here? > >> > >> James, is there a reason the models need to be distributed through > Apache? > >> Your time is your own, but going through legal could delay your > release. - > >> C > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > >
