Hi,

I am planning for some time to contribute a few extensions/components
for UIMA TextMarker, which have been developed by my students. As I will
rename all projects soon, I want to prepone the contributions.

I want to start a discussion with this mail, whether the contributions
are reasonable and welcome, and if the proposed procedure is OK.

textmarker-ep-textruler-kep:
A new rule learning algorithm with the idea, that humans use different
engineering patterns to create rule files. The implementation contains
simple learning algorithms for a few patterns and tries to combine the
different rules in order to gain advantage of their synergy.
Essentially, the resulting rules should resemble more the rules a human
would write.

textmarker-ep-textruler-trabal:
A new rule learning algorithm, which is able to induce
transformation-based error-driven rules. The basic idea is similar to
the Brill-Tagger, but it is completely generic (no rule templates) and
can also handle arbitrary annotations instead of tags of tokens.

textmarker-ep-augur:
This project is essentially about evaluating information extraction
models (textmarker rules) without labeled data. It is a new framework
similar to the testing views of the TextMarker Workbench, which are used
for back-testing and test-driven development. In contrast to the testing
views, the new framework is able to evaluate documents without a gold
dataset. Here, the user can specify background knowledge (constraints),
which are applied to estimate the accuracy.

Procedure of contribution:
- Create a Jira issue for each contribution
- Let the student attach the project (I do not know if there is still a
check box for the license)
- Commit the projects to sandbox/trunk
- Integrate projects in existing textmarker projects

Is there a way to avoid ICLA for each student?

Before the projects can be part of a future UIMA TextMarker release,
some additional work needs to be done. I would take care of that.

Best,

Peter

Reply via email to