Hello,

We would like to contribute our TextMarker system to Apache UIMA and want to ask, if the development team is interested in this contribution. The system is currently hosted on SourceForge (http://sourceforge.net/projects/textmarker/) and there is some documentation in the project wiki (http://tmwiki.informatik.uni-wuerzburg.de/).

I think it's a good start for that discussion, if I summarize the current status of the system. TextMarker is an Eclipse-based tool implemented in pure Java that can among other things be used to prototype analysis engines or develop complex handcrafted text processing applications. It consists of four major parts:

Language:
The rule or rather script language can be compared to regular expressions over annotation with additional conditions and actions. There are currently 28 different conditions and 34 actions. They range from a test on a feature value to a test, if the matched annotation is contained in another annotation of a given type, respectively from creating an annotation to applying an external dictionary or analysis engine. A TextMarker script can import type systems or define new types or variables. Then, there are also some more complex control structures for procedure calls, conditioned statements or recursion. The TextMarker language (and inference) is in active usage in some productive applications here, but it lacks of test cases. However, we are currently writing uimaFIT based component test to improve the quality management.

Workbench:
The Eclipse-based tool for developing the TextMarker scripts is currently based on DLTK 1.0 (http://www.eclipse.org/dltk/) and it's editor supports syntax highlighting, syntax checks, context-sensitive auto-completion, formatting, mark occurrences, open declaration and some other useful stuff commonly known in IDEs. For each script file, a type system and an executable analysis engine is created. Therefore, it's quite simple and efficient to create an analysis engine with a few lines of TextMarker rules. The workbench supports testing on annotated xmiCas while writing new rules and provides some minimal debugging functionality that explains why and on what text a rule was executed.

CEV:
This plugin can be used to edit or visualize xmiCAS and is also able to render HTML. It is heavily used by the testing and explanation components.

TextRuler:
This framework for rule learning is rather a playground and mainly implemented by students. There are currently more or less working implementations of LP2, WHISK, WIEN, RAPIER and an own algorithm, and three other algorithms are being implemented.


Overall, the system is working stable for a year now, but lacks in code quality, documentation and test cases. Basically, we are also willing to change the name of the system, if someone can think of a better one.

I'm looking forward to your comments.

Best regards,

Peter


--
---------------------------------------------------------------------
Dipl.-Inf. Peter Klügl
Universität Würzburg        Tel.: +49-(0)931-31-86741
Am Hubland                  Fax.: +49-(0)931-31-86732
97074 Würzburg              mail: [email protected]
     http://www.is.informatik.uni-wuerzburg.de/en/staff/kluegl_peter/
---------------------------------------------------------------------

Reply via email to