Revision: 17243
          http://sourceforge.net/p/gate/code/17243
Author:   markagreenwood
Date:     2014-01-22 16:06:49 +0000 (Wed, 22 Jan 2014)
Log Message:
-----------
documentation for the document normalizer plugin

Modified Paths:
--------------
    userguide/trunk/misc-creole.tex
    userguide/trunk/recent-changes.tex

Modified: userguide/trunk/misc-creole.tex
===================================================================
--- userguide/trunk/misc-creole.tex     2014-01-22 15:26:45 UTC (rev 17242)
+++ userguide/trunk/misc-creole.tex     2014-01-22 16:06:49 UTC (rev 17243)
@@ -3347,3 +3347,33 @@
 % OBVIOUSLY I'm not finished documenting this.
 % -- Adam
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\sect[sec:misc-creole:doc-normalizer]{Document Normalizer}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%
+A problem that occurs quite frequently when processing text documents created
+with modern WYSIWYG editors (Word is the main culprit) is that standard
+punctuation symbols, such as apostrophes and hyphens, are silently replaced
+by symbols that look \textit{``nicer''}. While there may be a good reason
+behind this substitution (i.e. printouts look better) it plays havoc with
+text processing. For example, a tokenizer that handles words with apostrophes
+in them will produce different output, and gazetteers are likely to use
+standard ASCII characters for hyphens and apostrophise.
+
+Whilst it may be possible to modify all processing resources to handle all
+different forms of each punctuation symbol it would be both a tedious and
+error prone process. A better solution would be to modify the documents
+as part of the processing pipeline to replace these characters with their
+normalized version.
+
+This plugin normalizes the punctuation (or any other characters) by editing
+the document content to replace them. Note that as this plugin edits the
+document content it should be run as the first PR in the pipeline in order
+to avoid problems with changes in annotation spans etc.
+
+The normalizations are controlled via a simple configuration file in which
+a pair of lines describes a single normalization; the first line is a regular
+expression describing the text to replace, and the second line is the
+replacement.

Modified: userguide/trunk/recent-changes.tex
===================================================================
--- userguide/trunk/recent-changes.tex  2014-01-22 15:26:45 UTC (rev 17242)
+++ userguide/trunk/recent-changes.tex  2014-01-22 16:06:49 UTC (rev 17243)
@@ -21,6 +21,14 @@
 
 \rcSect[next-release]{Next Release}
 
+\rcSubsect{January 2014}
+
+A new plugin that allows for document normalization has been added. This
+plugin is predominately aimed at normalizing punctuation symbols (i.e.
+replacing Word style apostrophies and hypens with their ASCII equivalemts)
+to provide a common baseline for further components. See Section
+\ref{sec:misc-creole:doc-normalizer} for further details.
+
 \rcSubsect{December 2013}
 
 The Relations API (Section \ref{sec:api:relations}) has been updated as

This was sent by the SourceForge.net collaborative development platform, the 
world's largest Open Source development site.


------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs

Reply via email to