Author: rezan
Date: Fri Jan 9 20:35:24 2015
New Revision: 1650652
URL: http://svn.apache.org/r1650652
Log:
patterns
Modified:
devicemap/branches/2.0/data/README_PATTERNS
Modified: devicemap/branches/2.0/data/README_PATTERNS
URL:
http://svn.apache.org/viewvc/devicemap/branches/2.0/data/README_PATTERNS?rev=1650652&r1=1650651&r2=1650652&view=diff
==============================================================================
--- devicemap/branches/2.0/data/README_PATTERNS (original)
+++ devicemap/branches/2.0/data/README_PATTERNS Fri Jan 9 20:35:24 2015
@@ -1,113 +1,267 @@
-Patterns 2.0 Draft 1
+= Pattern Specification 2.0 =
+Draft 1, 2014-01-09
-Everything in here is assumed to be UTF8.
+This is the DeviceMap data specification for patterns and attributes.
-It is completely valid for a client to return an initialization error
-if it cannot support the pattern file as configured.
+All encodings in this document are UTF8.
-The pattern file has a file format version number.
+==== Overview ====
-###
-INPUT PARSING INTO PATTERN TOKENS
-###
+This document goes over how the DeviceMap data domains are defined and how the
+classifiers will process user input against the domains.
-Each pattern file has a header. It defines the following attributes
-which instruct the client how to parse the input:
+The classification process is broken down into three phases:
--Transformers: a set of regular expressions, TODO: define better
--Token separators: a list of strings
--N-gram size: an int
+ * Input Parsing
+ * Pattern Matching
+ * Attribute Retrieval
-The input gets transformed thru the transformers (optional). Then it gets
-tokenized using the separators. No blank tokens. It then gets n-gram'ed.
-The default n-gram size is 1.
+The following definitions are used:
-The output of this process is a stream of pattern tokens which are passed into
-the pattern matcher as they are processed. Patterns must be streamed in order.
-If n-gram > 1 is configured, the largest n-gram needs to be process before
-moving onto the smaller ones.
+input string::
+::this is the string to be classified
-So for example, a domain can set its separator as a space, n-gram size of 2,
-and a lowercase transformer and a number transformer: [0-9]+ => _NUM.
-The following string:
+token stream::
+::this is the list of tokens that result from the Input Parsing phase
-Original: A 12 xyZ
+pattern::
+::this is a complete pattern definition with an id, type, rank, and pattern
tokens
+
+pattern tokens::
+::these are the individual pattern strings which comprise a pattern
+
+pattern type::
+::this defines how the pattern tokens must appear in the input string for the
pattern to be valid
+
+matched tokens::
+::these are pattern tokens which are successfully matched in the token stream
+
+The pattern and attribute files are JSON objects. These objects will contain:
+
+ * Format version
+ * Name
+ * Domain version
+ * Description
+ * Publish date
+
+
+
+== Input Parsing ==
+
+This step parses the input string and creates the token stream.
+
+Each pattern file defines the domain input parsing rules:
+
+input transformers::
+::Type: list of transformation steps
+::Optional. Default: none
+::TODO: define what exactly these can be.
+
+token separators::
+::Type: list of token seperator strings
+::Optional. Default: none
+
+ngram concatenation size::
+::Type: greater than zero integer
+::Optional. Default: 1
+
+The input string first gets processed thru the transformers.
+Then it gets tokenized using the configured seperators. Then ngram
+concatenation happens. The final result of these 3 steps is the token stream.
+
+
+==== Notes ====
+
+Empty tokens are removed from the tokenization step.
+
+When a token is created and added to the token stream, it can be processed by
the
+pattern matching step before moving on to the next token. This algorithm is
pipeline
+and thread safe.
+
+If the ngram concatenation size is greater than 1, the largest ngram must be
+made first before creating the smaller ngrams.
+
+
+==== Example ====
+
+{{{
+Transformer: lowercase, [0-9]+ => _NUM
+Token separators: [space]
+ngram concatenation size: 2
+
+input string: A 12 xyZ
Post transform: a _NUM xyz
-Tokens: a, _NUM, xyz
+Post tokenization: a, _NUM, xyz
+
+Post ngram (token stream): a_NUM, a, _NUMxyz, _NUM, xyz
+}}}
+
-Pattern token stream: a_NUM, a, _NUMxyz, _NUM, xyz
-###
-PATTERN TOKEN MATCHING
-###
-
-Each pattern file has a set of patterns. Each pattern defines its matching
-attributes. The highest ranking pattern is returned. All patterns are matched
-using simple UTF8 string matching. This allows for simple hashtable matching
-between a pattern token and the resulting pattern. So its very fast and has
-the scaling properties of the underlying hashtable implementation.
+== Pattern Matching ==
+
+This step processes the token stream and picks the highest ranking pattern
which
+matches on the stream..
+
+The pattern file defines a set of patterns. Each pattern has 2 main attributes,
+its pattern type and its pattern rank. The pattern
+type defines how the pattern is supposed to be matched against the token
stream.
+The pattern rank defines how the pattern ranks against other patterns.
+
+If the pattern type is successfully matched against the stream, it is now a
candidate
+for being returned. Candidates are ranked against each other using the pattern
ranking
+and the highest ranking pattern is returned.
+
+All the pattern types are prefixed with 'Simple'. This means that each pattern
token is matched
+using a plain UTF8 string comparison. No regex or other syntax is allowed in
Simple patterns.
+This allows the algorithm us use string hashing for matching. This gives
maximum performance
+and scaling complexity equal to a hashtable implementation. A SimpleHashCount
attribute can
+be defined which hints the classifier as to how many unique hashes it would
need to generate to
+support the pattern set.
Pattern attributes:
--PatternId: string, must be unique in the pattern set
--RankType: string
--Rank: int
--PatternType: string
--Pattern: object, its attributes are defined by the PatternType
-
-Also, 1 pattern can be configured to be the default pattern returned
-when no patterns match.
-
-RankType is either Strong, Weak, or None. Strong patterns ignore the Rank
-attribute and are ranked by their position in the pattern token stream.
-Therefor, when a strong pattern is matched, it can be immediately returned
-and no more processing is required. In the absence of a Strong pattern, the
-highest ranking Weak pattern is returned. In the absence of a Strong and Weak
-pattern, the highest ranking None pattern is returned. So a None pattern,
-regardless of its rank, can rank higher than a weak pattern. The same logic
-goes for a Weak and Strong patterns (Strong patterns have no rank). In the case
-of Weak and None, the whole Pattern stream must be processed. If no match is
-found, the default pattren is returned. If no default is defined, a null
pattern
-is returned. In all cases, just the patternId is returned.
-
-Rank is an integer used to rank patterns amoung their RankType. In the case of
-a tie, the pattern with the longest concatenate length of pattern tokens is
-returned. If that causes another tie, the first pattern found is returned.
+PatternId::
+::Type: String
+::Required.
+
+RankType::
+::Type: string
+::Required.
+
+RankValue::
+::Type: integer
+::Optional. Default: 0.
+::Use defined by RankType.
+
+PatternType::
+::Type: string
+::Required.
+
+PatternTokens::
+::Type: list of pattern token strings
+::Required.
+
+Default::
+::Type: boolean
+::Optional. Default: false.
+::Only 1 pattern can have a true value of false.
+
+
+==== PatternType ====
The following pattern types are defined:
-SimpleOrderedAnd - This pattern is an array of strings. Each string in the
array
-must appear in the pattern token stream in array index order. Its valid for
-other pattern tokens to appear inbetween the matched patterns and there is no
minimum
-or maximum proximity that the matched patterns must appear in.
-
-SimpleAnd - This pattern is an array of strings. Each string in the array must
-appear in the pattern token stream in any order. Its ok for other pattern
tokens
-to appear inbetween these patterns. No min or max proximity is definied.
-
-SimpleOr - This pattern is an array of strings. Only one of the strings must
-appear in the pattern stream for this pattern to be valid.
-
-Simple - This pattern is a single string. It must appear in the pattern stream
-to be valid.
-
-Note, while the word stream is used to define the set of pattern tokens,
-this is a single threaded serial process. Think of a for loop which iterates
-over the tokens, generates the pattern tokens, find pattern candidates, stores
-them, and then returns winning pattern.
-
-###
-PATTERN ATTRIBUTE LOOKUP AND OPTIONAL PARSING
-###
-
-When a PatternId is returned from the Matching phase, its used to lookup the
-matching attributes in the attributes file. The attributes are returned as
-a key value map along with the PatternId.
-
-Also, at this point, we can have an optional post processing step. The
attribute
-map can contain regex parsing rules which can be applied to the original
string to
-extract detailed information into new attributes. TODO: define better
+SimpleOrderedAnd::
+::Each pattern token must appear in the token stream in index orderi, as
defined
+in the PatternTokens list. Its okay for non matched tokens to appear inbetween
+matched tokens as long as the matched tokens are still in order.
+
+SimpleAnd::
+::Each pattern token must appear in the token stream. Order does not matter.
+
+Simple::
+::Only one pattern must appear in the token stream.
+
+
+==== RankType ====
+
+The following rank types are defined:
+
+Strong::
+::Strong patterns are ranked higher than Weak and None. The RankValue
+is ignored and they are ranked by their position
+in the pattern stream. The lower the position, the higher the rank.
+When a Strong pattern is found, the pattern matching step can stop and
+this pattern can be returned without analyzing the rest of the stream.
+This is because its impossible for another pattern to rank higher.
+
+Weak::
+::Weak patterns are ranked below Strong but above None. A Weak pattern can only
+be returned in the absence of a Strong pattern. Weak patterns always rank
higher
+than None patterns, regardless of the RankValue. The RankValue is used to rank
+between successfully matched Weak patterns.
+
+None::
+::None patterns are ranked below Strong and Weak. A None pattern can only be
+returned in the absence of successful Strong and Weak patterns. The RankValue
+is used to rank between successfully matched None patterns.
+
+In the case where 2 or more Weak or None patterns have the same RankValue
resulting in a tie,
+the pattern with the longest concatenated matched pattern length is used. If
that results in
+another tie, the pattern found first is returned.
+
+If no pattern is successfully matched, the default pattern is returned. If no
+default pattern is defined, a null pattern is returned.
+
+==== Notes ====
+
+If 2 or more patterns share the same PatternId, then only 1 of their
PatternTypes
+need to match. There is an implied OR between multiple PatternTypes with equal
PatternId.
+
+If more than 1 default is defined, the 1st one found in the Pattern file is
used.
+
+2 or more patterns cannot have identical RankType, RankValue, and matched
tokens. Since they will be
+found at the same time, the pattern the classifier chooses is undefined.
+
+
+==== Examples ====
+
+{{{
+Pattern:
+ PatternId: p1
+ RankType: Strong
+ PatternType: Simple
+ PatternTokens: bingo, jackpot
+
+Pattern:
+ PatternId: p2
+ RankType: Weak
+ RankValue: 100
+ PatternType: SimpleOrderedAnd
+ PatternTokens: two, four, six
+
+Pattern:
+ PatternId: p3
+ RankType: None
+ RankValue: 1000
+ PatternType: Simple
+ PatternTokens: two, four, six
+
+Token stream: one, two, three, four, five, six, seven
+Pattern: p2
+
+Token stream: one, two, three, six, five, four, seven
+Pattern: p3
+
+Token stream: one, two, three, four, five, six, bingo, seven
+Pattern: p1
+}}}
+
+
+
+== Attribute Retrieval ==
+
+This step processes the result of the Pattern Matching step. The PatternId is
used
+to look up the corresponding attribute map. The patternId and the attribute map
+are returned.
+
+
+==== Attribute Parsing ====
+
+An attribute map can contain attributes which are parsed out of the input
string.
+
+TODO: define this more
+
+
+==== Notes ====
+
+If no attribute map is found, an empty map is used.
+
+The attribute map must be immutable.
+
+If a null pattern is returned from the previous step, this must be safely
signaled back.
+TODO: how?
-TODO: Null pattern needs to be defined