README_PATTERNS

rezan Fri, 09 Jan 2015 12:42:25 -0800

Author: rezan
Date: Fri Jan  9 20:35:24 2015
New Revision: 1650652

URL: http://svn.apache.org/r1650652
Log:
patterns


Modified:
    devicemap/branches/2.0/data/README_PATTERNS

Modified: devicemap/branches/2.0/data/README_PATTERNS
URL: 
http://svn.apache.org/viewvc/devicemap/branches/2.0/data/README_PATTERNS?rev=1650652&r1=1650651&r2=1650652&view=diff
==============================================================================
--- devicemap/branches/2.0/data/README_PATTERNS (original)
+++ devicemap/branches/2.0/data/README_PATTERNS Fri Jan  9 20:35:24 2015
@@ -1,113 +1,267 @@
-Patterns 2.0 Draft 1
+= Pattern Specification 2.0 =
+Draft 1, 2014-01-09
 
-Everything in here is assumed to be UTF8.
+This is the DeviceMap data specification for patterns and attributes.
 
-It is completely valid for a client to return an initialization error
-if it cannot support the pattern file as configured.
+All encodings in this document are UTF8.
 
-The pattern file has a file format version number.
+==== Overview ====
 
-###
-INPUT PARSING INTO PATTERN TOKENS
-###
+This document goes over how the DeviceMap data domains are defined and how the
+classifiers will process user input against the domains.
 
-Each pattern file has a header. It defines the following attributes
-which instruct the client how to parse the input:
+The classification process is broken down into three phases:
 
--Transformers: a set of regular expressions, TODO: define better
--Token separators: a list of strings
--N-gram size: an int
+ * Input Parsing
+ * Pattern Matching
+ * Attribute Retrieval
 
-The input gets transformed thru the transformers (optional). Then it gets
-tokenized using the separators. No blank tokens. It then gets n-gram'ed.
-The default n-gram size is 1.
+The following definitions are used:
 
-The output of this process is a stream of pattern tokens which are passed into
-the pattern matcher as they are processed. Patterns must be streamed in order.
-If n-gram > 1 is configured, the largest n-gram needs to be process before
-moving onto the smaller ones.
+input string::
+::this is the string to be classified
 
-So for example, a domain can set its separator as a space, n-gram size of 2,
-and a lowercase transformer and a number transformer: [0-9]+ => _NUM.
-The following string:
+token stream::
+::this is the list of tokens that result from the Input Parsing phase
 
-Original: A 12 xyZ
+pattern::
+::this is a complete pattern definition with an id, type, rank, and pattern 
tokens
+
+pattern tokens::
+::these are the individual pattern strings which comprise a pattern
+
+pattern type::
+::this defines how the pattern tokens must appear in the input string for the 
pattern to be valid
+
+matched tokens::
+::these are pattern tokens which are successfully matched in the token stream
+
+The pattern and attribute files are JSON objects. These objects will contain:
+
+ * Format version
+ * Name
+ * Domain version
+ * Description
+ * Publish date
+
+
+
+== Input Parsing ==
+
+This step parses the input string and creates the token stream.
+
+Each pattern file defines the domain input parsing rules:
+
+input transformers::
+::Type: list of transformation steps
+::Optional. Default: none
+::TODO: define what exactly these can be.
+
+token separators::
+::Type: list of token seperator strings
+::Optional. Default: none
+
+ngram concatenation size::
+::Type: greater than zero integer
+::Optional. Default: 1
+
+The input string first gets processed thru the transformers.
+Then it gets tokenized using the configured seperators. Then ngram
+concatenation happens. The final result of these 3 steps is the token stream.
+
+
+==== Notes ====
+
+Empty tokens are removed from the tokenization step.
+
+When a token is created and added to the token stream, it can be processed by 
the
+pattern matching step before moving on to the next token. This algorithm is 
pipeline
+and thread safe.
+
+If the ngram concatenation size is greater than 1, the largest ngram must be
+made first before creating the smaller ngrams.
+
+
+==== Example ====
+
+{{{
+Transformer: lowercase, [0-9]+ => _NUM
+Token separators: [space]
+ngram concatenation size: 2
+
+input string: A 12 xyZ
 
 Post transform: a _NUM xyz
 
-Tokens: a, _NUM, xyz
+Post tokenization: a, _NUM, xyz
+
+Post ngram (token stream): a_NUM, a, _NUMxyz, _NUM, xyz
+}}}
+
 
-Pattern token stream: a_NUM, a, _NUMxyz, _NUM, xyz
 
-###
-PATTERN TOKEN MATCHING
-###
-
-Each pattern file has a set of patterns. Each pattern defines its matching
-attributes. The highest ranking pattern is returned. All patterns are matched
-using simple UTF8 string matching. This allows for simple hashtable matching
-between a pattern token and the resulting pattern. So its very fast and has
-the scaling properties of the underlying hashtable implementation.
+== Pattern Matching ==
+
+This step processes the token stream and picks the highest ranking pattern 
which
+matches on the stream..
+
+The pattern file defines a set of patterns. Each pattern has 2 main attributes,
+its pattern type and its pattern rank. The pattern
+type defines how the pattern is supposed to be matched against the token 
stream.
+The pattern rank defines how the pattern ranks against other patterns.
+
+If the pattern type is successfully matched against the stream, it is now a 
candidate
+for being returned. Candidates are ranked against each other using the pattern 
ranking
+and the highest ranking pattern is returned.
+
+All the pattern types are prefixed with 'Simple'. This means that each pattern 
token is matched
+using a plain UTF8 string comparison. No regex or other syntax is allowed in 
Simple patterns.
+This allows the algorithm us use string hashing for matching. This gives 
maximum performance
+and scaling complexity equal to a hashtable implementation. A SimpleHashCount 
attribute can
+be defined which hints the classifier as to how many unique hashes it would 
need to generate to
+support the pattern set.
 
 Pattern attributes:
 
--PatternId: string, must be unique in the pattern set
--RankType: string
--Rank: int
--PatternType: string
--Pattern: object, its attributes are defined by the PatternType
-
-Also, 1 pattern can be configured to be the default pattern returned
-when no patterns match.
-
-RankType is either Strong, Weak, or None. Strong patterns ignore the Rank
-attribute and are ranked by their position in the pattern token stream.
-Therefor, when a strong pattern is matched, it can be immediately returned
-and no more processing is required. In the absence of a Strong pattern, the
-highest ranking Weak pattern is returned. In the absence of a Strong and Weak
-pattern, the highest ranking None pattern is returned. So a None pattern,
-regardless of its rank, can rank higher than a weak pattern. The same logic
-goes for a Weak and Strong patterns (Strong patterns have no rank). In the case
-of Weak and None, the whole Pattern stream must be processed. If no match is
-found, the default pattren is returned. If no default is defined, a null 
pattern
-is returned. In all cases, just the patternId is returned.
-
-Rank is an integer used to rank patterns amoung their RankType. In the case of
-a tie, the pattern with the longest concatenate length of pattern tokens is
-returned. If that causes another tie, the first pattern found is returned.
+PatternId::
+::Type: String
+::Required.
+
+RankType::
+::Type: string
+::Required.
+
+RankValue::
+::Type: integer
+::Optional. Default: 0.
+::Use defined by RankType.
+
+PatternType::
+::Type: string
+::Required.
+
+PatternTokens::
+::Type: list of pattern token strings
+::Required.
+
+Default::
+::Type: boolean
+::Optional. Default: false.
+::Only 1 pattern can have a true value of false.
+
+
+==== PatternType ====
 
 The following pattern types are defined:
 
-SimpleOrderedAnd - This pattern is an array of strings. Each string in the 
array
-must appear in the pattern token stream in array index order. Its valid for
-other pattern tokens to appear inbetween the matched patterns and there is no 
minimum
-or maximum proximity that the matched patterns must appear in.
-
-SimpleAnd - This pattern is an array of strings. Each string in the array must
-appear in the pattern token stream in any order. Its ok for other pattern 
tokens
-to appear inbetween these patterns. No min or max proximity is definied.
-
-SimpleOr - This pattern is an array of strings. Only one of the strings must
-appear in the pattern stream for this pattern to be valid.
-
-Simple - This pattern is a single string. It must appear in the pattern stream
-to be valid.
-
-Note, while the word stream is used to define the set of pattern tokens,
-this is a single threaded serial process. Think of a for loop which iterates
-over the tokens, generates the pattern tokens, find pattern candidates, stores
-them, and then returns winning pattern.
-
-###
-PATTERN ATTRIBUTE LOOKUP AND OPTIONAL PARSING
-###
-
-When a PatternId is returned from the Matching phase, its used to lookup the
-matching attributes in the attributes file. The attributes are returned as
-a key value map along with the PatternId.
-
-Also, at this point, we can have an optional post processing step. The 
attribute
-map can contain regex parsing rules which can be applied to the original 
string to
-extract detailed information into new attributes. TODO: define better
+SimpleOrderedAnd::
+::Each pattern token must appear in the token stream in index orderi, as 
defined
+in the PatternTokens list. Its okay for non matched tokens to appear inbetween
+matched tokens as long as the matched tokens are still in order.
+
+SimpleAnd::
+::Each pattern token must appear in the token stream. Order does not matter.
+
+Simple::
+::Only one pattern must appear in the token stream.
+
+
+==== RankType ====
+
+The following rank types are defined:
+
+Strong::
+::Strong patterns are ranked higher than Weak and None. The RankValue
+is ignored and they are ranked by their position
+in the pattern stream. The lower the position, the higher the rank.
+When a Strong pattern is found, the pattern matching step can stop and
+this pattern can be returned without analyzing the rest of the stream.
+This is because its impossible for another pattern to rank higher.
+
+Weak::
+::Weak patterns are ranked below Strong but above None. A Weak pattern can only
+be returned in the absence of a Strong pattern. Weak patterns always rank 
higher
+than None patterns, regardless of the RankValue. The RankValue is used to rank
+between successfully matched Weak patterns.
+
+None::
+::None patterns are ranked below Strong and Weak. A None pattern can only be
+returned in the absence of successful Strong and Weak patterns. The RankValue
+is used to rank between successfully matched None patterns.
+
+In the case where 2 or more Weak or None patterns have the same RankValue 
resulting in a tie,
+the pattern with the longest concatenated matched pattern length is used. If 
that results in
+another tie, the pattern found first is returned.
+
+If no pattern is successfully matched, the default pattern is returned. If no
+default pattern is defined, a null pattern is returned.
+
+==== Notes ====
+
+If 2 or more patterns share the same PatternId, then only 1 of their 
PatternTypes
+need to match. There is an implied OR between multiple PatternTypes with equal 
PatternId.
+
+If more than 1 default is defined, the 1st one found in the Pattern file is 
used.
+
+2 or more patterns cannot have identical RankType, RankValue, and matched 
tokens. Since they will be
+found at the same time, the pattern the classifier chooses is undefined.
+
+
+==== Examples ====
+
+{{{
+Pattern:
+  PatternId: p1
+  RankType: Strong
+  PatternType: Simple
+  PatternTokens: bingo, jackpot
+
+Pattern:
+  PatternId: p2
+  RankType: Weak
+  RankValue: 100
+  PatternType: SimpleOrderedAnd
+  PatternTokens: two, four, six
+
+Pattern:
+  PatternId: p3
+  RankType: None
+  RankValue: 1000
+  PatternType: Simple
+  PatternTokens: two, four, six
+
+Token stream: one, two, three, four, five, six, seven
+Pattern: p2
+
+Token stream: one, two, three, six, five, four, seven
+Pattern: p3
+
+Token stream: one, two, three, four, five, six, bingo, seven
+Pattern: p1
+}}}
+
+
+
+== Attribute Retrieval ==
+
+This step processes the result of the Pattern Matching step. The PatternId is 
used
+to look up the corresponding attribute map. The patternId and the attribute map
+are returned.
+
+
+==== Attribute Parsing ====
+
+An attribute map can contain attributes which are parsed out of the input 
string.
+
+TODO: define this more
+
+
+==== Notes ====
+
+If no attribute map is found, an empty map is used.
+
+The attribute map must be immutable.
+
+If a null pattern is returned from the previous step, this must be safely 
signaled back.
+TODO: how?
 
-TODO: Null pattern needs to be defined

svn commit: r1650652 - /devicemap/branches/2.0/data/README_PATTERNS

Reply via email to