[Devicemap Wiki] Update of "Patterns2" by rezan

Apache Wiki Fri, 09 Jan 2015 13:01:33 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Devicemap Wiki" for 
change notification.


The "Patterns2" page has been changed by rezan:
https://wiki.apache.org/devicemap/Patterns2

Comment:
pattern spec

New page:
<<TableOfContents(2)>>

= Pattern Specification 2.0 =
Draft 1, 2014-01-09

This is the DeviceMap data specification for patterns and attributes.

All encodings in this document are UTF8.

=== Overview ===

This document goes over how the DeviceMap data domains are defined and how the
classifiers will process user input against the domains.

The classification process is broken down into three phases:

 * Input Parsing
 * Pattern Matching
 * Attribute Retrieval

The following definitions are used:

 input string::
 :: this is the string to be classified

 token stream::
 :: this is the list of tokens that result from the Input Parsing phase

 pattern::
 :: this is a complete pattern definition with an id, type, rank, and pattern 
tokens

 pattern tokens::
 :: these are the individual pattern strings which comprise a pattern

 pattern type::
 :: this defines how the pattern tokens must appear in the input string for the 
pattern to be valid

 matched tokens::
 :: these are pattern tokens which are successfully matched in the token stream

The pattern and attribute files are JSON objects. These objects will contain:

 * Format version
 * Name
 * Domain version
 * Description
 * Publish date



= Input Parsing =

This step parses the input string and creates the token stream.

Each pattern file defines the domain input parsing rules:

 input transformers::
 :: Type: list of transformation steps
 :: Optional. Default: none
 :: TODO: define what exactly these can be.

 token separators::
 :: Type: list of token seperator strings
 :: Optional. Default: none

 ngram concatenation size::
 :: Type: greater than zero integer
 :: Optional. Default: 1

The input string first gets processed thru the transformers.
Then it gets tokenized using the configured seperators. Then ngram
concatenation happens. The final result of these 3 steps is the token stream.


=== Notes ===

Empty tokens are removed from the tokenization step.

When a token is created and added to the token stream, it can be processed by 
the
pattern matching step before moving on to the next token. This algorithm is 
pipeline
and thread safe.

If the ngram concatenation size is greater than 1, the largest ngram must be
made first before creating the smaller ngrams.


=== Example ===

{{{
Transformer: lowercase, [0-9]+ => _NUM
Token separators: [space]
ngram concatenation size: 2

input string: A 12 xyZ

Post transform: a _NUM xyz

Post tokenization: a, _NUM, xyz

Post ngram (token stream): a_NUM, a, _NUMxyz, _NUM, xyz
}}}



= Pattern Matching =

This step processes the token stream and picks the highest ranking pattern which
matches on the stream.

The pattern file defines a set of patterns. Each pattern has 2 main attributes,
its pattern type and its pattern rank. The pattern
type defines how the pattern is supposed to be matched against the token stream.
The pattern rank defines how the pattern ranks against other patterns.

If the pattern type is successfully matched against the stream, it is now a 
candidate
for being returned. Candidates are ranked against each other using the pattern 
ranking
and the highest ranking pattern is returned.

All the pattern types are prefixed with 'Simple'. This means that each pattern 
token is matched
using a plain UTF8 string comparison. No regex or other syntax is allowed in 
Simple patterns.
This allows the algorithm us use string hashing for matching. This gives 
maximum performance
and scaling complexity equal to a hashtable implementation. A SimpleHashCount 
attribute can
be defined which hints the classifier as to how many unique hashes it would 
need to generate to
support the pattern set.

Pattern attributes:

 PatternId::
 :: Type: String
 :: Required.

 RankType::
 :: Type: string
 :: Required.

 RankValue::
 :: Type: integer
 :: Optional. Default: 0.
 :: Use defined by RankType.

 PatternType::
 :: Type: string
 :: Required.

 PatternTokens::
 :: Type: list of pattern token strings
 :: Required.

 Default::
 :: Type: boolean
 :: Optional. Default: false.
 :: Only 1 pattern can have a true value of false.


== PatternType ==

The following pattern types are defined:

 SimpleOrderedAnd::
 :: Each pattern token must appear in the token stream in index order, as 
defined in the PatternTokens list. Its okay for non matched tokens to appear 
inbetween matched tokens as long as the matched tokens are still in order.

 SimpleAnd::
 :: Each pattern token must appear in the token stream. Order does not matter.

 Simple::
 :: Only one pattern must appear in the token stream.


== RankType ==

The following rank types are defined:

 Strong::
 :: Strong patterns are ranked higher than Weak and None. The RankValue is 
ignored and they are ranked by their position in the pattern stream. The lower 
the position, the higher the rank. When a Strong pattern is found, the pattern 
matching step can stop and this pattern can be returned without analyzing the 
rest of the stream. This is because its impossible for another pattern to rank 
higher.

 Weak::
 :: Weak patterns are ranked below Strong but above None. A Weak pattern can 
only be returned in the absence of a Strong pattern. Weak patterns always rank 
higher than None patterns, regardless of the RankValue. The RankValue is used 
to rank between successfully matched Weak patterns.

 None::
 :: None patterns are ranked below Strong and Weak. A None pattern can only be 
returned in the absence of successful Strong and Weak patterns. The RankValue 
is used to rank between successfully matched None patterns.

In the case where 2 or more Weak or None patterns have the same RankValue 
resulting in a tie,
the pattern with the longest concatenated matched pattern length is used. If 
that results in
another tie, the pattern found first is returned.

If no pattern is successfully matched, the default pattern is returned. If no
default pattern is defined, a null pattern is returned.

=== Notes ===

If 2 or more patterns share the same PatternId, then only 1 of their 
PatternTypes
need to match. There is an implied OR between multiple PatternTypes with equal 
PatternId.

If more than 1 default is defined, the 1st one found in the Pattern file is 
used.

2 or more patterns cannot have identical RankType, RankValue, and matched 
tokens. Since they will be
found at the same time, the pattern the classifier chooses is undefined.


=== Examples ===

{{{
Pattern:
  PatternId: p1
  RankType: Strong
  PatternType: Simple
  PatternTokens: bingo, jackpot

Pattern:
  PatternId: p2
  RankType: Weak
  RankValue: 100
  PatternType: SimpleOrderedAnd
  PatternTokens: two, four, six

Pattern:
  PatternId: p3
  RankType: None
  RankValue: 1000
  PatternType: Simple
  PatternTokens: two, four, six

Token stream: one, two, three, four, five, six, seven
Pattern: p2

Token stream: one, two, three, six, five, four, seven
Pattern: p3

Token stream: one, two, three, four, five, six, bingo, seven
Pattern: p1
}}}



= Attribute Retrieval =

This step processes the result of the Pattern Matching step. The PatternId is 
used
to look up the corresponding attribute map. The patternId and the attribute map
are returned.


=== Attribute Parsing ===

An attribute map can contain attributes which are parsed out of the input 
string.

TODO: define this more


=== Notes ===

If no attribute map is found, an empty map is used.

The attribute map must be immutable.

If a null pattern is returned from the previous step, this must be safely 
signaled back.
TODO: how?

[Devicemap Wiki] Update of "Patterns2" by rezan

Reply via email to