README_PATTERNS

rezan Thu, 08 Jan 2015 14:48:07 -0800

Author: rezan
Date: Thu Jan  8 22:47:50 2015
New Revision: 1650410

URL: http://svn.apache.org/r1650410
Log:
transformers


Modified:
    devicemap/branches/2.0/data/README_PATTERNS

Modified: devicemap/branches/2.0/data/README_PATTERNS
URL: 
http://svn.apache.org/viewvc/devicemap/branches/2.0/data/README_PATTERNS?rev=1650410&r1=1650409&r2=1650410&view=diff
==============================================================================
--- devicemap/branches/2.0/data/README_PATTERNS (original)
+++ devicemap/branches/2.0/data/README_PATTERNS Thu Jan  8 22:47:50 2015
@@ -14,12 +14,13 @@ INPUT PARSING INTO PATTERN TOKENS
 Each pattern file has a header. It defines the following attributes
 which instruct the client how to parse the input:
 
+-Transformers: a set of regular expressions, TODO: define better
 -Token separators: a list of strings
 -N-gram size: an int
--Transformers: a set of regular expressions, TODO: define this better
 
-The input gets tokenized using the separators. It then gets n-gram'ed. The
-default n-gram size is 1. Each ngram is then passed thru optional transformers.
+The input gets transformed thru the transformers (optional). Then it gets
+tokenized using the separators. No blank tokens. It then gets n-gram'ed.
+The default n-gram size is 1.
 
 The output of this process is a stream of pattern tokens which are passed into
 the pattern matcher as they are processed. Patterns must be streamed in order.
@@ -27,13 +28,16 @@ If n-gram > 1 is configured, the largest
 moving onto the smaller ones.
 
 So for example, a domain can set its separator as a space, n-gram size of 2,
-and a lowercasing transformer expression. The following string:
+and a lowercase transformer and a number transformer: [0-9]+ => _NUM.
+The following string:
 
-A 12 xyZ
+Original: A 12 xyZ
 
-Produces the following pattern token stream:
+Post transform: a _NUM xyz
 
-a12, a, 12xyz, 12, xyz
+Tokens: a, _NUM, xyz
+
+Pattern token stream: a_NUM, a, _NUMxyz, _NUM, xyz
 
 ###
 PATTERN TOKEN MATCHING
@@ -104,6 +108,6 @@ a key value map along with the PatternId
 
 Also, at this point, we can have an optional post processing step. The 
attribute
 map can contain regex parsing rules which can be applied to the original 
string to
-extract detailed information. TODO: this needs to be defined better
+extract detailed information into new attributes. TODO: define better
 
 TODO: Null pattern needs to be defined

svn commit: r1650410 - /devicemap/branches/2.0/data/README_PATTERNS

Reply via email to