Hi, I am planning for some time to contribute a few extensions/components for UIMA TextMarker, which have been developed by my students. As I will rename all projects soon, I want to prepone the contributions.
I want to start a discussion with this mail, whether the contributions are reasonable and welcome, and if the proposed procedure is OK. textmarker-ep-textruler-kep: A new rule learning algorithm with the idea, that humans use different engineering patterns to create rule files. The implementation contains simple learning algorithms for a few patterns and tries to combine the different rules in order to gain advantage of their synergy. Essentially, the resulting rules should resemble more the rules a human would write. textmarker-ep-textruler-trabal: A new rule learning algorithm, which is able to induce transformation-based error-driven rules. The basic idea is similar to the Brill-Tagger, but it is completely generic (no rule templates) and can also handle arbitrary annotations instead of tags of tokens. textmarker-ep-augur: This project is essentially about evaluating information extraction models (textmarker rules) without labeled data. It is a new framework similar to the testing views of the TextMarker Workbench, which are used for back-testing and test-driven development. In contrast to the testing views, the new framework is able to evaluate documents without a gold dataset. Here, the user can specify background knowledge (constraints), which are applied to estimate the accuracy. Procedure of contribution: - Create a Jira issue for each contribution - Let the student attach the project (I do not know if there is still a check box for the license) - Commit the projects to sandbox/trunk - Integrate projects in existing textmarker projects Is there a way to avoid ICLA for each student? Before the projects can be part of a future UIMA TextMarker release, some additional work needs to be done. I would take care of that. Best, Peter