Revision: 17243
http://sourceforge.net/p/gate/code/17243
Author: markagreenwood
Date: 2014-01-22 16:06:49 +0000 (Wed, 22 Jan 2014)
Log Message:
-----------
documentation for the document normalizer plugin
Modified Paths:
--------------
userguide/trunk/misc-creole.tex
userguide/trunk/recent-changes.tex
Modified: userguide/trunk/misc-creole.tex
===================================================================
--- userguide/trunk/misc-creole.tex 2014-01-22 15:26:45 UTC (rev 17242)
+++ userguide/trunk/misc-creole.tex 2014-01-22 16:06:49 UTC (rev 17243)
@@ -3347,3 +3347,33 @@
% OBVIOUSLY I'm not finished documenting this.
% -- Adam
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\sect[sec:misc-creole:doc-normalizer]{Document Normalizer}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%
+A problem that occurs quite frequently when processing text documents created
+with modern WYSIWYG editors (Word is the main culprit) is that standard
+punctuation symbols, such as apostrophes and hyphens, are silently replaced
+by symbols that look \textit{``nicer''}. While there may be a good reason
+behind this substitution (i.e. printouts look better) it plays havoc with
+text processing. For example, a tokenizer that handles words with apostrophes
+in them will produce different output, and gazetteers are likely to use
+standard ASCII characters for hyphens and apostrophise.
+
+Whilst it may be possible to modify all processing resources to handle all
+different forms of each punctuation symbol it would be both a tedious and
+error prone process. A better solution would be to modify the documents
+as part of the processing pipeline to replace these characters with their
+normalized version.
+
+This plugin normalizes the punctuation (or any other characters) by editing
+the document content to replace them. Note that as this plugin edits the
+document content it should be run as the first PR in the pipeline in order
+to avoid problems with changes in annotation spans etc.
+
+The normalizations are controlled via a simple configuration file in which
+a pair of lines describes a single normalization; the first line is a regular
+expression describing the text to replace, and the second line is the
+replacement.
Modified: userguide/trunk/recent-changes.tex
===================================================================
--- userguide/trunk/recent-changes.tex 2014-01-22 15:26:45 UTC (rev 17242)
+++ userguide/trunk/recent-changes.tex 2014-01-22 16:06:49 UTC (rev 17243)
@@ -21,6 +21,14 @@
\rcSect[next-release]{Next Release}
+\rcSubsect{January 2014}
+
+A new plugin that allows for document normalization has been added. This
+plugin is predominately aimed at normalizing punctuation symbols (i.e.
+replacing Word style apostrophies and hypens with their ASCII equivalemts)
+to provide a common baseline for further components. See Section
+\ref{sec:misc-creole:doc-normalizer} for further details.
+
\rcSubsect{December 2013}
The Relations API (Section \ref{sec:api:relations}) has been updated as
This was sent by the SourceForge.net collaborative development platform, the
world's largest Open Source development site.
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs