Author: rezan
Date: Thu Jan 8 22:47:50 2015
New Revision: 1650410
URL: http://svn.apache.org/r1650410
Log:
transformers
Modified:
devicemap/branches/2.0/data/README_PATTERNS
Modified: devicemap/branches/2.0/data/README_PATTERNS
URL:
http://svn.apache.org/viewvc/devicemap/branches/2.0/data/README_PATTERNS?rev=1650410&r1=1650409&r2=1650410&view=diff
==============================================================================
--- devicemap/branches/2.0/data/README_PATTERNS (original)
+++ devicemap/branches/2.0/data/README_PATTERNS Thu Jan 8 22:47:50 2015
@@ -14,12 +14,13 @@ INPUT PARSING INTO PATTERN TOKENS
Each pattern file has a header. It defines the following attributes
which instruct the client how to parse the input:
+-Transformers: a set of regular expressions, TODO: define better
-Token separators: a list of strings
-N-gram size: an int
--Transformers: a set of regular expressions, TODO: define this better
-The input gets tokenized using the separators. It then gets n-gram'ed. The
-default n-gram size is 1. Each ngram is then passed thru optional transformers.
+The input gets transformed thru the transformers (optional). Then it gets
+tokenized using the separators. No blank tokens. It then gets n-gram'ed.
+The default n-gram size is 1.
The output of this process is a stream of pattern tokens which are passed into
the pattern matcher as they are processed. Patterns must be streamed in order.
@@ -27,13 +28,16 @@ If n-gram > 1 is configured, the largest
moving onto the smaller ones.
So for example, a domain can set its separator as a space, n-gram size of 2,
-and a lowercasing transformer expression. The following string:
+and a lowercase transformer and a number transformer: [0-9]+ => _NUM.
+The following string:
-A 12 xyZ
+Original: A 12 xyZ
-Produces the following pattern token stream:
+Post transform: a _NUM xyz
-a12, a, 12xyz, 12, xyz
+Tokens: a, _NUM, xyz
+
+Pattern token stream: a_NUM, a, _NUMxyz, _NUM, xyz
###
PATTERN TOKEN MATCHING
@@ -104,6 +108,6 @@ a key value map along with the PatternId
Also, at this point, we can have an optional post processing step. The
attribute
map can contain regex parsing rules which can be applied to the original
string to
-extract detailed information. TODO: this needs to be defined better
+extract detailed information into new attributes. TODO: define better
TODO: Null pattern needs to be defined