+1 suggestion for documenting many examples of "getting started" NLP datasets.
I have at least one we can use that was created by our lead Pathologist https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cases/train/traincase.xml We should provide at least one sample for each domain. Trouble is, privacy requires that these examples be made up by hand and not copy-pasted from EMR systems. --Andy On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <[email protected]> wrote: > +1 for this observation Andy! > > Lowering time will motive users in writing blogs about features, how to, > etc., which reduces core team work load on documentation. > > I have been trying to write a small "how to write standalone client for > ctakes" with my experience (I saw at least 4 users posted similar question > in last 2 months), but not getting enough time because ctakes depends on > lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,), most of > my spare time is being spent on juggling between these frameworks, posting > and browsing those forums, relating observations to ctakes code. I think we > need to have some high level documentation about these (with links to > corresponding forums). > > Above case is for developers (I think this will be more user base as ctakes > progress), for users I think documentation is lot better though some > improvements need to be done. > > As a developer I felt tough with lack of sample training data (I am still > struggling in this area even though I browsed all relevant code), though > training class are there. I understood that there are licensing issues with > REAL data, but at least some hand made example sentences, which may not be > real but helps developers in understanding the type/structure of input > TRAINING classes expecting. This way people who browse the code can reverse > engineer and develop their own models. Sorry if you guys feel this as > novice issue, but I feel most of the developers will be novice when they > adopt a system and Machine Learning/NLP is ocean. Some documentation in > this area will same lot of time for us. > > I wish there will be some activity in this area from ctakes core team. > > Thank you, > Giri > > > > On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry <[email protected]>wrote: > >> ctakes is at a point where we have a LOT of features but it is still hard >> to get started. >> >> Judging from the mailing lists a lot of how cTakes works is not obvious >> and requires hand holding. >> This is very typical in early FOSS projects. >> >> Lowering the time to get invested in ctakes gets more users AND better bug >> reports, FAQ, etc. >> >> thoughts? >> --Andy >> >> >> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <[email protected]> >> wrote: >> >>> Hi, >>> I just wanted to gauge the interest of creating the next release of >> cTAKES (3.1) which is currently marked for May in Jira- >>> >>> There have already been 22/53 issues [1] marked as fixed or closed. >> Plenty of bug fixes and new components including: >>> - New CEM Instance Template population >>> - New Dependency Parser/Semantic Role Labeler >>> - New optional Clear POSTagger >>> - New regression testing component >>> >>> Should we wait for the Temporal component? >>> >>> [1] >> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%22%20AND%20project%20%3D%20CTAKES >>> >> >>
