Author: rezan
Date: Fri Jan 9 22:15:31 2015
New Revision: 1650683
URL: http://svn.apache.org/r1650683
Log:
updated
Modified:
devicemap/branches/2.0/data/README_PATTERNS
Modified: devicemap/branches/2.0/data/README_PATTERNS
URL:
http://svn.apache.org/viewvc/devicemap/branches/2.0/data/README_PATTERNS?rev=1650683&r1=1650682&r2=1650683&view=diff
==============================================================================
--- devicemap/branches/2.0/data/README_PATTERNS (original)
+++ devicemap/branches/2.0/data/README_PATTERNS Fri Jan 9 22:15:31 2015
@@ -5,8 +5,6 @@ Draft 1, 2014-01-09
This is the DeviceMap data specification for patterns and attributes.
-All encodings in this document are UTF8.
-
=== Overview ===
This document goes over how the DeviceMap data domains are defined and how the
@@ -56,16 +54,16 @@ This step parses the input string and cr
Each pattern file defines the domain input parsing rules:
- inputTransformers::
+ InputTransformers::
:: Type: list of transformation steps
:: Optional. Default: none
:: TODO: define what exactly these can be.
- tokenSeparators::
+ TokenSeparators::
:: Type: list of token seperator strings
:: Optional. Default: none
- ngramConcatSize::
+ NgramConcatSize::
:: Type: greater than zero integer
:: Optional. Default: 1
@@ -82,18 +80,18 @@ When a token is created and added to the
pattern matching step before moving on to the next token. This algorithm is
pipeline
and thread safe.
-If the ngramConcatSize is greater than 1, the largest ngram must be
+If the Ngram``Concat``Size is greater than 1, the largest ngram must be
made first before creating the smaller ngrams.
=== Example ===
{{{
-inputTransformers: lowercase, [0-9]+ => _NUM
-tokenSeparators: [space]
-ngramConcatSize: 2
+InputTransformers: lowercase, s/[0-9]+/_NUM/g, s/-//g
+TokenSeparators: [space]
+NgramConcatSize: 2
-Input string: A 12 xyZ
+Input string: A 12 x-yZ
Transform: a _NUM xyz
@@ -120,7 +118,7 @@ and the highest ranking pattern is retur
All the pattern types are prefixed with 'Simple'. This means that each pattern
token is matched
using a plain UTF8 string comparison. No regex or other syntax is allowed in
Simple patterns.
-This allows the algorithm to use simple string hashing for matching. This
gives maximum performance and scaling complexity equal to a hashtable
implementation. A Simple``HashCount attribute can be optionally defined which
hints the classifier as to how many unique hashes it would need to generate to
support the pattern set.
+This allows the algorithm to use simple string hashing for matching. This
gives maximum performance and scaling complexity equal to a hashtable
implementation. A Simple``Hash``Count attribute can be optionally defined which
hints the classifier as to how many unique hashes it would need to generate to
support the pattern set.
Pattern attributes:
@@ -147,7 +145,7 @@ Pattern attributes:
Default::
:: Type: boolean
:: Optional. Default: false.
- :: Only 1 pattern can have a true value of false.
+ :: Only 1 pattern can have a true value.
== PatternType ==
@@ -253,3 +251,12 @@ The attribute map must be immutable.
If a null pattern is returned from the previous step, this must be safely
returned.
TODO: how?
+
+
+= Patch Files =
+
+The pattern and attribute files can be patched with a user created pattern and
+attribute file. In this case, parsing configurations override, pattern
+definitions get appended (you can override using pattern ranking), and
attributes
+override using the Pattern``Id.
+