README_PATTERNS

rezan Fri, 09 Jan 2015 14:16:04 -0800

Author: rezan
Date: Fri Jan  9 22:15:31 2015
New Revision: 1650683

URL: http://svn.apache.org/r1650683
Log:
updated


Modified:
    devicemap/branches/2.0/data/README_PATTERNS

Modified: devicemap/branches/2.0/data/README_PATTERNS
URL: 
http://svn.apache.org/viewvc/devicemap/branches/2.0/data/README_PATTERNS?rev=1650683&r1=1650682&r2=1650683&view=diff
==============================================================================
--- devicemap/branches/2.0/data/README_PATTERNS (original)
+++ devicemap/branches/2.0/data/README_PATTERNS Fri Jan  9 22:15:31 2015
@@ -5,8 +5,6 @@ Draft 1, 2014-01-09
 
 This is the DeviceMap data specification for patterns and attributes.
 
-All encodings in this document are UTF8.
-
 === Overview ===
 
 This document goes over how the DeviceMap data domains are defined and how the
@@ -56,16 +54,16 @@ This step parses the input string and cr
 
 Each pattern file defines the domain input parsing rules:
 
- inputTransformers::
+ InputTransformers::
  :: Type: list of transformation steps
  :: Optional. Default: none
  :: TODO: define what exactly these can be.
 
- tokenSeparators::
+ TokenSeparators::
  :: Type: list of token seperator strings
  :: Optional. Default: none
 
- ngramConcatSize::
+ NgramConcatSize::
  :: Type: greater than zero integer
  :: Optional. Default: 1
 
@@ -82,18 +80,18 @@ When a token is created and added to the
 pattern matching step before moving on to the next token. This algorithm is 
pipeline
 and thread safe.
 
-If the ngramConcatSize is greater than 1, the largest ngram must be
+If the Ngram``Concat``Size is greater than 1, the largest ngram must be
 made first before creating the smaller ngrams.
 
 
 === Example ===
 
 {{{
-inputTransformers: lowercase, [0-9]+ => _NUM
-tokenSeparators:   [space]
-ngramConcatSize:   2
+InputTransformers: lowercase, s/[0-9]+/_NUM/g, s/-//g
+TokenSeparators:   [space]
+NgramConcatSize:   2
 
-Input string:  A 12 xyZ
+Input string:  A 12 x-yZ
 
 Transform:     a _NUM xyz
 
@@ -120,7 +118,7 @@ and the highest ranking pattern is retur
 
 All the pattern types are prefixed with 'Simple'. This means that each pattern 
token is matched
 using a plain UTF8 string comparison. No regex or other syntax is allowed in 
Simple patterns.
-This allows the algorithm to use simple string hashing for matching. This 
gives maximum performance and scaling complexity equal to a hashtable 
implementation. A Simple``HashCount attribute can be optionally defined which 
hints the classifier as to how many unique hashes it would need to generate to 
support the pattern set.
+This allows the algorithm to use simple string hashing for matching. This 
gives maximum performance and scaling complexity equal to a hashtable 
implementation. A Simple``Hash``Count attribute can be optionally defined which 
hints the classifier as to how many unique hashes it would need to generate to 
support the pattern set.
 
 Pattern attributes:
 
@@ -147,7 +145,7 @@ Pattern attributes:
  Default::
  :: Type: boolean
  :: Optional. Default: false.
- :: Only 1 pattern can have a true value of false.
+ :: Only 1 pattern can have a true value.
 
 
 == PatternType ==
@@ -253,3 +251,12 @@ The attribute map must be immutable.
 If a null pattern is returned from the previous step, this must be safely 
returned.
 TODO: how?
 
+
+
+= Patch Files =
+
+The pattern and attribute files can be patched with a user created pattern and
+attribute file. In this case, parsing configurations override, pattern
+definitions get appended (you can override using pattern ranking), and 
attributes
+override using the Pattern``Id.
+

svn commit: r1650683 - /devicemap/branches/2.0/data/README_PATTERNS

Reply via email to