Author: pkluegl Date: Tue Dec 20 15:32:17 2011 New Revision: 1221318 URL: http://svn.apache.org/viewvc?rev=1221318&view=rev Log: UIMA-2285 converted to maven project added a proxy book and old (out-dated) introduction for testing the maven build process
Added: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/pom.xml uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/images/tools/tools.textmarker/symboltaxo.png (with props) uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/proxy-book.xml Modified: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.xml Added: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/pom.xml URL: http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/pom.xml?rev=1221318&view=auto ============================================================================== --- uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/pom.xml (added) +++ uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/pom.xml Tue Dec 20 15:32:17 2011 @@ -0,0 +1,23 @@ +<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> + <modelVersion>4.0.0</modelVersion> + <artifactId>uima-docbook-textmarker</artifactId> + <version>2.4.1-SNAPSHOT</version> + <packaging>pom</packaging> + <parent> + <groupId>org.apache.uima</groupId> + <artifactId>uimaj-parent</artifactId> + <version>2.4.1-SNAPSHOT</version> + <relativePath>../uimaj-parent/pom.xml</relativePath> + </parent> + <name>Apache UIMA SDK Documentation - TextMarker</name> + <url>${uimaWebsiteUrl}</url> + <scm> + <url>http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker</url> + <connection>scm:svn:http://svn.apache.org/repos/asf/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker</connection> + <developerConnection>scm:svn:https://svn.apache.org/repos/asf/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker</developerConnection> + </scm> + <properties> + <uimaScmProject>${project.artifactId}</uimaScmProject> + <bookNameRoot>proxy-book</bookNameRoot> + </properties> +</project> \ No newline at end of file Added: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/images/tools/tools.textmarker/symboltaxo.png URL: http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/images/tools/tools.textmarker/symboltaxo.png?rev=1221318&view=auto ============================================================================== Binary file - no diff available. Propchange: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/images/tools/tools.textmarker/symboltaxo.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/proxy-book.xml URL: http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/proxy-book.xml?rev=1221318&view=auto ============================================================================== --- uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/proxy-book.xml (added) +++ uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/proxy-book.xml Tue Dec 20 15:32:17 2011 @@ -0,0 +1,27 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" +"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"> +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> +<book lang="en"> + <title>TextMarker Guide and Reference</title> + <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="../../target/docbook-shared/common_book_info.xml"/> + <toc/> + <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="tools.textmarker.xml"/> +</book> Modified: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.xml URL: http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.xml?rev=1221318&r1=1221317&r2=1221318&view=diff ============================================================================== --- uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.xml (original) +++ uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.xml Tue Dec 20 15:32:17 2011 @@ -25,4 +25,227 @@ under the License. --> <chapter id="ugr.tools.tm"> + <title>TextMarker User's Guide</title> + <titleabbrev>TextMarker User's Guide</titleabbrev> + + <section id="ugr.tools.tm.introduction"> + <title>TextMarker</title> + <para>The TextMarker system is a rule-based tool for information + extraction and text processing tasks. The comprehensible rule + language + can be easily extended and supports several scripting + functionalities. + TextMarker provides a DLTK-based IDE, an integration + and a build + process for UIMA components. + </para> + <section id="ugr.tools.tm.introduction.metaphor"> + <title>Introduction</title> + <para> + In manual information extraction humans often apply a strategy + according to a highlighter metaphor: First relevant headlines are + considered and classified according to their content by coloring + them + with different highlighters. The paragraphs of the annotated + headlines + are then considered further. Relevant text fragments or + single words + in the context of that headline can then be colored. In + this way, a + top-down analysis and extraction strategy is implemented. + Necessary + additional information can then be added that either refers + to other + text segments or contains valuable domain specific + information. + Finally the colored text can be easily analyzed + concerning the + relevant information. + + The TextMarker system (textmarker + is a common german word for a + highlighter) tries to imitate this + manual extraction method by + formalizing the appropriate actions using + matching rules: The rules + mark sequences of words, extract text + segments or modify the input + document depending on textual + features.The default input for the + TextMarker system is + semi-structured text, but it can also process + structured or free + text. Technically, HTML is often the input + format, + since most word + processing documents can be converted to HTML. + Additionally, the + TextMarker systems offers the possibility to + create + a modified output + document. + </para> + </section> + <section id="ugr.tools.tm.introduction.concepts"> + <title>Core Concepts</title> + <para> + As a first step in the extraction process the TextMarker system uses + a + tokenizer (scanner) to tokenize the input document and to create a + stream of basic symbols. The types and valid annotations of the + possible tokens are predefined by a taxonomy of annotation types. + Annotations simply refer to a section of the input document and + assign a type or concept to the respective text fragment. The figure + on the right shows an excerpt of a basic annotation taxonomy: CW + describes all tokens, for example, that contains a single word + starting with a capital letter, MARKUP corresponds to HTML or XML + tags, and PM refers to all kinds of punctuations marks. Take a look + at [basic annotations|BasicAnnotationList] for a complete list of + initial annotations. + + + <screenshot> + <mediaobject> + <imageobject> + <imagedata scale="100" format="PNG" fileref="&imgroot;symboltaxo.png" /> + </imageobject> + <textobject> + <phrase>Part of a taxonomy for basic annotation types.</phrase> + </textobject> + </mediaobject> + </screenshot> + + By using (and extending) the taxonomy, the knowledge engineer is + able + to choose the most adequate types and concepts when defining new + matching rules, i.e., TextMarker rules for matching a text fragment + given by a set of symbols to an annotation. If the capitalization of + a word, for example, is of no importance, then the annotation type W + that describes words of any kind can be used. The initial scanner + creates a set of basic annotations that may be used by the matching + rules of the TextMarker language. However, most information + extraction applications require domain specific concepts and + annotations. Therefore, the knowledge engineer is able to extend the + set of annotations, and to define new annotation types tuned to the + requirements of the given domain. These types can be flexibly + integrated in the taxonomy of annotation types. + + One of the goals in + developing a new information extraction language + was + to maintain an + easily readable syntax while still providing a + scalable + expressiveness of the language. Basically, the TextMarker + language + contains expressions for the definition of new annotation + types and + for defining new matching rules. The rules are defined by a + list of + rule elements. + Each rule element contains at least a basic matching + condition referring + to text fragments or already specified + annotations. Additionally a + list of conditions and actions may be + specified for a rule element. + Whereas the conditions describe + necessary attributes of the matched + text fragment, the actions point + to operations and assignments on + the + current fragments. These actions + will then only be executed if all + basic conditions matched on a text + fragment or the annotation and the + related conditions are fulfilled. + </para> + </section> + <section id="ugr.tools.tm.introduction.examples"> + <title>Examples</title> + <para> + The usage of the language and its readability can be demonstrated by + simple examples: + + <programlisting> + CW{INLIST('animals.txt') -> MARK(Animal)}; + Animal "and" Animal{-> MARK(Animalpair, 1, 2, 3)}; + </programlisting> + + The first rule looks at all capitalized words that are listed in an + external document animals.txt and creates a new annotation of the + type + animal using the boundaries of the matched word. The second rule + searches for an annotation of the type animal followed by the + literal + and and a second animal annotation. Then it will create a new + annotation animalpair covering the text segment that matched the + three + rule elements (the digit parameters refer to the number of + matched + rule element). + + <programlisting> + Document{-> MARKFAST(Firstname, 'firstnames.txt')}; + Firstname CW{-> MARK(Lastname)}; + Paragraph{VOTE(Firstname, Lastname) -> LOG("Found more Firstnames than Lastnames")}; + </programlisting> + + In this example, the first rule annotates all words that occur in + the + external document firstnames.txt with the type firstname. The + second + rule creates a lastname annotation for all capitalized word + that + follow a firstname annotation. The last rule finally processes + all + paragraph} annotations. If the VOTE condition counts more + firstname + than lastname annotations, then the rule writes a log entry + with a + predefined message. + + + <programlisting> + ANY+{PARTOF(Paragraph), CONTAINS(Delete, 50, 100, true) -> MARK(Delete)}; + Firstname{-> MARK(Delete,1 , 2)} Lastname; + Delete{-> DEL}; + </programlisting> + + Here, the first rule looks for sequences of any kind of tokens + except + markup and creates one annotation of the type delete for each + sequence, if the tokens are part of a paragraph annotation and + contains together already more than 50% of delete annoations. The + + signs indicate this greedy processing. The second rule annotates + first + names followed by last names with the type delete and the third + rule + simply deletes all text segments that are associated with that + delete + annotation. + + </para> + </section> + <section id="ugr.tools.tm.introduction.features"> + <title>Special Features</title> + <para> + The TextMarker language features some special characteristics + that are + usually not found in other rule-based information extraction + systems + or even shift it towards scripting languages. The possibility + of + creating new annotation types and integrating them into the + taxonomy + facilitates an even more modular development of information + extraction systems. + + Read more about robust extraction using + filtering, complex control + structures and heuristic extraction using + scoring rules. + </para> + </section> + </section> </chapter> \ No newline at end of file