[ 
https://issues.apache.org/jira/browse/CTAKES-374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702366#comment-14702366
 ] 

Pei Chen commented on CTAKES-374:
---------------------------------

This would be a pretty cool feature and would be a nice contribution.  
distribute the load within the document and then reduce back after all of the 
paragraphs are done.

> Scaleout of cTAKES pipeline
> ---------------------------
>
>                 Key: CTAKES-374
>                 URL: https://issues.apache.org/jira/browse/CTAKES-374
>             Project: cTAKES
>          Issue Type: New Feature
>    Affects Versions: future enhancement
>            Reporter: Selina Chu
>             Fix For: 3.2.1
>
>
> Currently, cTAKES can't be easily deployed in an asynchronous manner. UIMA 
> components aren't serializable (and thus cTAKES' components as well).  Would 
> like to come up with better ways to allow cTAKES to be easily run in a 
> distributed fashion.
> For example, for processing a long document (e.g. 10+ pages), cTAKES would 
> take a long time to process.
> I would like to see a feature where we can partition the input to cTAKES, in 
> a way that won't affect the cTAKES annotation performance, allowing us to 
> process through a cluster running in distributed mode (e.g. Spark streaming 
> cTAKES).  And then recombine the results such that the word/phrase token 
> positions will be sequentially ordered.
> We have a simple implementation of the ClinicalPipelineFactory with Spark 
> Streaming.  Currently our initial attempt in partitioning is by paragraphs. 
> For example, we are doing something like:
> RDD.map(a_single_paragraph.process_in_ctakes())
> I also wanted to see if there are any better ways of doing this.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to