Re: Next cTAKES release (3.1)?

Tim Miller Tue, 02 Jul 2013 15:31:47 -0700

Agreed that you could definitely help out, and that would be a great wayto do so. We don't really have "examples" right now, more like justshort test sentences for showing simple results and verifying thatnothing has been broken by changes. I think regular length fake butrealistic notes would be very useful.

Tim

On 07/02/2013 05:19 PM, John Green wrote:

Hi all,


Ive been following this mail list for a couple of months. Im a third year 
medical student rounding the bend toward my MD. I used to be a computer 
programmer, however, and continue my own projects. Im very interested in 
contributing eventually to cTakes development. In the meantime, given the 
current talk of examples, if any domain specific examples needed generated I am 
domain knowledgable enough that I could pound out a few free text notes made to 
order.

Let me know, you all may already have docs on hand willing todo this, but if 
not...

John Green

Sent from my iPhone

On Jun 28, 2013, at 8:59, "Chen, Pei" <[email protected]> wrote:

I completely agree with making cTAKES easier use.  I think it is exciting to 
hear the different use cases here and understanding where some of the areas 
that need improvements are (which we haven't thought about earlier).
I think Tim's suggestions and the 3 concrete actionable items makes a lot of 
sense.  Hopefully it should attract new users, adopters, and perhaps more 
committers.

i) Make the typesystem forefront in documentation -- generate javadocs and
have as a link on the ctakes frontpage/sidebar
ii) Similar to the way that we are aiming to have tests in every module, also
have clearly labeled examples in every module that set up a pipeline, run on
sample notes (could be the same sample notes from the tests), and do
something with the results.
iii) Follow Giri's recommendation to have example training data for people
who want to take the next step and train their own models

I think Java developers are accustomed to including a library as a 
dependency/jar, have an API to pass input, and get the results via pojos;  So 
the examples could initially shield the complexity of wiring a pipeline 
together etc.
If we can improve the API's and how it gets integrated with other apps, we can 
add any GUI/CLI tools on top of this afterwards.

--Pei

-----Original Message-----
From: Miller, Timothy [mailto:[email protected]]
Sent: Friday, June 28, 2013 8:00 AM
To: [email protected]
Subject: Re: Next cTAKES release (3.1)?

Very interesting discussion. I think Giri is right about giving example training
data in the format that our training code can read. While our ultimate goal
would be to build and release models that are completely domain-
independent, in the real world it is almost always better to use some
domain-specific data and we should think more about how to facilitate that.

As for making it easier to get started, it is not totally clear to me what this
means/how to do it so it might be useful to get specific about what this
means. I think our biggest hurdle is

1) Prerequisite of understanding UIMA/UIMAFit

Since UIMAFit is officially becoming part of UIMA that will be easier, and
hopefully people will just learn the easier (in my opinion) UIMAFit way than
the standard UIMA way of doing things. Is there something we can be doing
to make understanding UIMA easier? Or do we just need to say upfront that
this is a prerequisite and hope that people don't give up due to this thing that
is out of our control?

Another hurdle is:

2) cTAKES is a multi-purpose developer-aimed tool

So it's not just a matter of hiding complexity -- at some point people have to
understand their problem, understand cTAKES' capabilities, and start coding.
Pei's GUI will help for some common use cases but will not remove the
requirement that someone at the organization knows cTAKES.
I think one part of this problem is the fact that the typesystem is not well
documented. A developer needs to know what the output is (objects from
the typesystem), how to get them (which modules/pipelines), and what
information is in them. So maybe on this end my recommendation would be:
i) Make the typesystem forefront in documentation -- generate javadocs and
have as a link on the ctakes frontpage/sidebar
ii) Similar to the way that we are aiming to have tests in every module, also
have clearly labeled examples in every module that set up a pipeline, run on
sample notes (could be the same sample notes from the tests), and do
something with the results.
iii) Follow Giri's recommendation to have example training data for people
who want to take the next step and train their own models

This is quite a bit of developer overhead, so it's worth asking whether you
agree with my "diagnosis" and "treatment" or whether you think there are
different problems/solutions that should be higher priority.

Tim

On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:

Hi Vijay and Andy,

Thanks for sharing those examples.

"Trouble is, privacy requires that these examples be made up by hand"

Agree with this statement and this is very valid concern.

In "getting started examples", I think we should just have couple of
entries (5-10 small entries), not more than that (with explicit
statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I

understand

handcrafting these may not be easy because we are not medical domain
experts, but I feel worth time, because it brings in more user community.

Thank you,
Giri





On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry

<[email protected]>wrote:

GREAT !

The i2b2 data though isn't publicly distributable, you still need to
request access to it since it is "semi private"


On Jun 27, 2013, at 9:52 PM, vijay garla <[email protected]> wrote:

We released code on using cTAKES to annotate clinical text and SVMs
that use the annotations to classify clinical text from the CMC 2007
and I2B2
2008 challenges:

We did the cmd 2007 with cTAKES 2.5:

https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr
o

ducing_results_on_CMC_2007_challenge
<https://code.google.com/p/ytex/downloads/list>

And the i2b2 2008 with the version of cTAKES distributed with the
first version of ARC:
https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008

These are both publicly available datasets, and represent real-world
problems (in general I believe when publishing a paper the code
should be reproducible and made publicly available, but that's a different

issue).

When we get around to upgrading YTEX to cTAKES 3.1, we would like to
upgrade these samples as well.

Best,

VJ



On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry
<[email protected]
wrote:

+1 suggestion for documenting many examples of "getting started"
+NLP
datasets.

I have at least one we can use that was created by our lead
Pathologist

https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas

es/train/traincase.xml

We should provide at least one sample for each domain.
Trouble is, privacy requires that these examples be made up by hand
and not copy-pasted from EMR systems.

--Andy

On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <

[email protected]>

wrote:

+1 for this observation Andy!

Lowering time will motive users in writing blogs about features,
how

to,

etc., which reduces core team work load on documentation.

I have been trying to write a small "how to write standalone
client for ctakes" with my experience (I saw at least 4 users
posted similar

question

in last 2 months), but not getting enough time because ctakes
depends

on

lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,),
most

of

my spare time is being spent on juggling between these frameworks,

posting

and browsing those forums, relating observations to ctakes code. I

think

we

need to have some high level documentation about these (with links
to corresponding forums).

Above case is for developers (I think this will be more user base
as

ctakes

progress), for users I think documentation is lot better though
some improvements need to be done.

As a developer I felt tough with lack of sample training data (I
am

still

struggling in this area even though I browsed all relevant code),

though

training class are there. I understood that there are licensing
issues

with

REAL data, but at least some hand made example sentences, which
may not

be

real but helps developers in understanding the type/structure of
input TRAINING classes expecting. This way people who browse the
code can

reverse

engineer and develop their own models. Sorry if you guys feel this
as novice issue, but I feel most of the developers will be novice
when

they

adopt a system and Machine Learning/NLP is ocean. Some
documentation in this area will same lot of time for us.

I wish there will be some activity in this area from ctakes core team.

Thank you,
Giri



On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry
<[email protected]
wrote:

ctakes is at a point where we have a LOT of features but it is
still

hard

to get started.

Judging from the mailing lists a lot of how cTakes works is not

obvious

and requires hand holding.
This is very typical in early FOSS projects.

Lowering the time to get invested in ctakes gets more users AND
better

bug

reports, FAQ, etc.

thoughts?
--Andy


On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <

[email protected]>

wrote:

Hi,
I just wanted to gauge the interest of creating the next release
of

cTAKES (3.1) which is currently marked for May in Jira-

There have already been 22/53 issues [1] marked as fixed or closed.

Plenty of bug fixes and new components including:

- New CEM Instance Template population
- New Dependency Parser/Semantic Role Labeler
- New optional Clear POSTagger
- New regression testing component

Should we wait for the Temporal component?

[1]

https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%

22%20AND%20project%20%3D%20CTAKES

Re: Next cTAKES release (3.1)?

Reply via email to