Repository: asterixdb
Updated Branches:
  refs/heads/master cb3ca25f3 -> 8bbf08131


[ASTERIXDB-2455][DOC] Deprecate AQL documentations

- user model changes: no
- storage format changes: no
- interface changes: no

details:
- Create [Deprecated] section and move AQL docs to there.
- Move some docs from /aql directory to /sqlpp directory.

Change-Id: I677dd7a8d114197eaa2ae93e0405184526b31a03
Reviewed-on: https://asterix-gerrit.ics.uci.edu/2977
Sonar-Qube: Jenkins <[email protected]>
Reviewed-by: Ian Maxon <[email protected]>
Tested-by: Jenkins <[email protected]>
Contrib: Jenkins <[email protected]>
Integration-Tests: Jenkins <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/asterixdb/repo
Commit: http://git-wip-us.apache.org/repos/asf/asterixdb/commit/8bbf0813
Tree: http://git-wip-us.apache.org/repos/asf/asterixdb/tree/8bbf0813
Diff: http://git-wip-us.apache.org/repos/asf/asterixdb/diff/8bbf0813

Branch: refs/heads/master
Commit: 8bbf08131bd679baab5aea11cd860c420bbc9216
Parents: cb3ca25
Author: Taewoo Kim <[email protected]>
Authored: Mon Sep 24 17:34:50 2018 -0700
Committer: Taewoo Kim <[email protected]>
Committed: Mon Sep 24 19:13:45 2018 -0700

----------------------------------------------------------------------
 .../src/site/markdown/aql/filters.md            | 147 ------------
 .../src/site/markdown/aql/fulltext.md           | 114 ----------
 .../src/site/markdown/aql/similarity.md         | 227 -------------------
 .../src/site/markdown/sqlpp/filters.md          | 147 ++++++++++++
 .../src/site/markdown/sqlpp/fulltext.md         | 114 ++++++++++
 .../src/site/markdown/sqlpp/similarity.md       | 227 +++++++++++++++++++
 asterixdb/asterix-doc/src/site/site.xml         |  22 +-
 7 files changed, 499 insertions(+), 499 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/asterixdb/blob/8bbf0813/asterixdb/asterix-doc/src/site/markdown/aql/filters.md
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-doc/src/site/markdown/aql/filters.md 
b/asterixdb/asterix-doc/src/site/markdown/aql/filters.md
deleted file mode 100644
index 6b8e00f..0000000
--- a/asterixdb/asterix-doc/src/site/markdown/aql/filters.md
+++ /dev/null
@@ -1,147 +0,0 @@
-<!--
- ! Licensed to the Apache Software Foundation (ASF) under one
- ! or more contributor license agreements.  See the NOTICE file
- ! distributed with this work for additional information
- ! regarding copyright ownership.  The ASF licenses this file
- ! to you under the Apache License, Version 2.0 (the
- ! "License"); you may not use this file except in compliance
- ! with the License.  You may obtain a copy of the License at
- !
- !   http://www.apache.org/licenses/LICENSE-2.0
- !
- ! Unless required by applicable law or agreed to in writing,
- ! software distributed under the License is distributed on an
- ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- ! KIND, either express or implied.  See the License for the
- ! specific language governing permissions and limitations
- ! under the License.
- !-->
-
-# Filter-Based LSM Index Acceleration
-
-## <a id="toc">Table of Contents</a>
-
-* [Motivation](#Motivation)
-* [Filters in AsterixDB](#FiltersInAsterixDB)
-* [Filters and Merge Policies](#FiltersAndMergePolicies)
-
-## <a id="Motivation">Motivation</a> <font size="4"><a href="#toc">[Back to 
TOC]</a></font>
-
-Traditional relational databases usually employ conventional index
-structures such as B+ trees due to their low read latency.  However,
-such traditional index structures use in-place writes to perform
-updates, resulting in costly random writes to disk. Today's emerging
-applications often involve insert-intensive workloads for which the
-cost of random writes prohibits efficient ingestion of
-data. Consequently, popular NoSQL systems such as Cassandra, HBase,
-LevelDB, BigTable, etc. have adopted Log-Structured Merge (LSM) Trees
-as their storage structure. LSM-trees avoids the cost of random writes
-by batching updates into a component of the index that resides in main
-memory -- an *in-memory component*. When the space occupancy of
-the in-memory component exceeds a specified threshold, its entries are
-*flushed* to disk forming a new component -- a *disk component*. As
-disk components accumulate on disk, they are periodically merged
-together subject to a *merge policy* that decides when and what to
-merge. The benefit of the LSM-trees comes at the cost of possibly
-sacrificing read efficiency, but, it has been shown in previous
-studies that these inefficiencies can be mostly mitigated.
-
-AsterixDB has also embraced LSM-trees, not just by using them as
-primary indexes, but also by using the same LSM-ification technique
-for all of its secondary index structures. In particular, AsterixDB
-adopted a generic framework for converting a class of indexes (that
-includes conventional B+ trees, R trees, and inverted indexes) into
-LSM-based secondary indexes, allowing higher data ingestion rates. In
-fact, for certain index structures, our results have shown that using
-an LSM-based version of an index can be made to significantly
-outperform its conventional counterpart for *both* ingestion
-and query speed (an example of such an index being the R-tree for
-spatial data).
-
-Since an LSM-based index naturally partitions data into multiple disk
-components, it is possible, when answering certain queries, to exploit
-partitioning to only access some components and safely filter out the
-remaining components, thus reducing query times. For instance,
-referring to our
-[TinySocial](primer.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB)
-example, suppose a user always retrieves tweets from the
-`TweetMessages` dataset based on the `send-time` field (e.g., tweets
-posted in the last 24 hours). Since there is not a secondary index on
-the `send-time` field, the only available option for AsterixDB would
-be to scan the whole `TweetMessages` dataset and then apply the
-predicate as a post-processing step. However, if disk components of
-the primary index were tagged with the minimum and maximum timestamp
-values of the objects they contain, we could utilize the tagged
-information to directly access the primary index and prune components
-that do not match the query predicate. Thus, we could save substantial
-cost by avoiding scanning the whole dataset and only access the
-relevant components. We simply call such tagging information that are
-associated with components, filters. (Note that even if there were a
-secondary index on `send-time` field, using filters could save
-substantial cost by avoiding accessing the secondary index, followed
-by probing the primary index for every fetched entry.) Moreover, the
-same filtering technique can also be used with any secondary LSM index
-(e.g., an LSM R-tree), in case the query contains multiple predicates
-(e.g., spatial and temporal predicates), to obtain similar pruning
-power.
-
-## <a id="FiltersInAsterixDB">Filters in AsterixDB</a> <font size="4"><a 
href="#toc">[Back to TOC]</a></font>
-
-We have added support for LSM-based filters to all of AsterixDB's
-index types. To enable the use of filters, the user must specify the
-filter's key when creating a dataset, as shown below:
-
-#### Creating a Dataset with a Filter  ####
-
-        create dataset Tweets(TweetType) primary key tweetid with filter on 
send-time;
-
-Filters can be created on any totally ordered datatype (i.e., any
-field that can be indexed using a B+ -tree), such as integers,
-doubles, floats, UUIDs, datetimes, etc.
-
-When a dataset with a filter is created, the name of the filter's key
-field is persisted in the `Metadata.Dataset` dataset (which is the metadata
-dataset that stores the details of each dataset in an AsterixDB
-instance) so that DML operations against the dataset can recognize the
-existence of filters and can update them or utilize them
-accordingly. Creating a dataset with a filter in AsterixDB implies
-that the primary and all secondary indexes of that dataset will
-maintain filters on their disk components. Once a filtered dataset is
-created, the user can use the dataset normally (just like any other
-dataset). AsterixDB will automatically maintain the filters and will
-leverage them to efficiently answer queries whenever possible (i.e.,
-when a query has predicates on the filter's key).
-
-## <a id="FiltersAndMergePolicies">Filters and Merge Policies</a> <font 
size="4"><a href="#toc">[Back to TOC]</a></font>
-
-The AsterixDB default merge policy, the prefix merge policy, relies on
-component sizes and the number of components to decide which
-components to merge. This merge policy has proven to provide excellent
-performance for both ingestion and queries. However, when evaluating
-our filtering solution with the prefix policy, we observed a behavior
-that can reduce filter effectiveness. In particular, we noticed that
-under the prefix merge policy, the disk components of a secondary
-index tend to be constantly merged into a single component. This is
-because the prefix policy relies on a single size parameter for all of
-the indexes of a dataset. This parameter is typically chosen based on
-the sizes of the disk components of the primary index, which tend to
-be much larger than the sizes of the secondary indexes' disk
-components. This difference caused the prefix merge policy to behave
-similarly to the constant merge policy (i.e., relatively poorly) when
-applied to secondary indexes in the sense that the secondary indexes
-are constantly merged into a single disk component. Consequently, the
-effectiveness of filters on secondary indexes was greatly reduced
-under the prefix-merge policy, but they were still effective when
-probing the primary index.  Based on this behavior, we developed a new
-merge policy, an improved version of the prefix policy, called the
-correlated-prefix policy. The basic idea of this policy is that it
-delegates the decision of merging the disk components of all the
-indexes in a dataset to the primary index. When the policy decides
-that the primary index needs to be merged (using the same decision
-criteria as for the prefix policy), then it will issue successive
-merge requests to the I/O scheduler on behalf of all other indexes
-associated with the same dataset. The end result is that secondary
-indexes will always have the same number of disk components as their
-primary index under the correlated-prefix merge policy. This has
-improved query performance, since disk components of secondary indexes
-now have a much better chance of being pruned.

http://git-wip-us.apache.org/repos/asf/asterixdb/blob/8bbf0813/asterixdb/asterix-doc/src/site/markdown/aql/fulltext.md
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-doc/src/site/markdown/aql/fulltext.md 
b/asterixdb/asterix-doc/src/site/markdown/aql/fulltext.md
deleted file mode 100644
index 1328ed9..0000000
--- a/asterixdb/asterix-doc/src/site/markdown/aql/fulltext.md
+++ /dev/null
@@ -1,114 +0,0 @@
-<!--
- ! Licensed to the Apache Software Foundation (ASF) under one
- ! or more contributor license agreements.  See the NOTICE file
- ! distributed with this work for additional information
- ! regarding copyright ownership.  The ASF licenses this file
- ! to you under the Apache License, Version 2.0 (the
- ! "License"); you may not use this file except in compliance
- ! with the License.  You may obtain a copy of the License at
- !
- !   http://www.apache.org/licenses/LICENSE-2.0
- !
- ! Unless required by applicable law or agreed to in writing,
- ! software distributed under the License is distributed on an
- ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- ! KIND, either express or implied.  See the License for the
- ! specific language governing permissions and limitations
- ! under the License.
- !-->
-
-# AsterixDB  Support of Full-text search queries #
-
-## <a id="toc">Table of Contents</a> ##
-
-* [Motivation](#Motivation)
-* [Syntax](#Syntax)
-* [Creating and utilizing a Full-text index](#FulltextIndex)
-
-## <a id="Motivation">Motivation</a> <font size="4"><a href="#toc">[Back to 
TOC]</a></font> ##
-
-Full-Text Search (FTS) queries are widely used in applications where users 
need to find records that satisfy
-an FTS predicate, i.e., where simple string-based matching is not sufficient. 
These queries are important when
-finding documents that contain a certain keyword is crucial. FTS queries are 
different from substring matching
-queries in that FTS queries find their query predicates as exact keywords in 
the given string, rather than
-treating a query predicate as a sequence of characters. For example, an FTS 
query that finds “rain” correctly
-returns a document when it contains “rain” as a word. However, a 
substring-matching query returns a document
-whenever it contains “rain” as a substring, for instance, a document with 
“brain” or “training” would be
-returned as well.
-
-## <a id="Syntax">Syntax</a> <font size="4"><a href="#toc">[Back to 
TOC]</a></font> ##
-
-The syntax of AsterixDB FTS follows a portion of the XQuery FullText Search 
syntax.
-Two basic forms are as follows:
-
-        ftcontains(Expression1, Expression2, {FullTextOption})
-        ftcontains(Expression1, Expression2)
-
-For example, we can execute the following query to find Chirp messages where 
the `messageText` field includes
-“voice” as a word. Please note that an FTS search is case-insensitive.
-Thus, "Voice" or "voice" will be evaluated as the same word.
-
-        use TinySocial;
-
-        select element {"chirpId": msg.chirpId}
-        from ChirpMessages msg
-        where ftcontains(msg.messageText, "voice", {"mode":"any"});
-
-The DDL and DML of TinySocial can be found in [ADM: Modeling Semistructed Data 
in 
AsterixDB](../sqlpp/primer-sqlpp.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB).
-
-The `Expression1` is an expression that should be evaluable as a string at 
runtime as in the above example
-where `msg.messageText` is a string field. The `Expression2` can be a string, 
an (un)ordered list
-of string value(s), or an expression. In the last case, the given expression 
should be evaluable
-into one of the first two types, i.e., into a string value or an (un)ordered 
list of string value(s).
-
-The following examples are all valid expressions.
-
-       ... where ftcontains(msg.messageText, "sound")
-       ... where ftcontains(msg.messageText, "sound", {"mode":"any"})
-       ... where ftcontains(msg.messageText, ["sound", "system"], 
{"mode":"any"})
-       ... where ftcontains(msg.messageText, {{"speed", "stand", 
"customization"}}, {"mode":"all"})
-
-The last `FullTextOption` parameter clarifies the given FTS request. If you 
omit the `FullTextOption` parameter,
-then the default value will be set for each possible option. Currently, we 
only have one option named `mode`.
-And as we extend the FTS feature, more options will be added. Please note that 
the format of `FullTextOption`
-is a record, thus you need to put the option(s) in a record `{}`.
-The `mode` option indicates whether the given FTS query is a conjunctive (AND) 
or disjunctive (OR) search request.
-This option can be either `“all”` (AND) or `“any”` (OR). The default 
value for `mode` is `“all”`. If one specifies `“any”`,
-a disjunctive search will be conducted. For example, the following query will 
find documents whose `messageText`
-field contains “sound” or “system”, so a document will be returned if 
it contains either “sound”, “system”,
-or both of the keywords.
-
-       ... where ftcontains(msg.messageText, ["sound", "system"], 
{"mode":"any"})
-
-The other option parameter,`“all”`, specifies a conjunctive search. The 
following examples will find the documents whose
-`messageText` field contains both “sound” and “system”. If a document 
contains only “sound” or “system” but
-not both, it will not be returned.
-
-       ... where ftcontains(msg.messageText, ["sound", "system"], 
{"mode":"all"})
-       ... where ftcontains(msg.messageText, ["sound", "system"])
-
-Currently AsterixDB doesn’t (yet) support phrase searches, so the following 
query will not work.
-
-       ... where ftcontains(msg.messageText, "sound system", {"mode":"any"})
-
-As a workaround solution, the following query can be used to achieve a roughly 
similar goal. The difference is that
-the following queries will find documents where `msg.messageText` contains 
both “sound” and “system”, but the order
-and adjacency of “sound” and “system” are not checked, unlike in a 
phrase search. As a result, the query below would
-also return documents with “sound system can be installed.”, “system 
sound is perfect.”,
-or “sound is not clear. You may need to install a new system.”
-
-       ... where ftcontains(msg.messageText, ["sound", "system"], 
{"mode":"all"})
-       ... where ftcontains(msg.messageText, ["sound", "system"])
-
-
-## <a id="FulltextIndex">Creating and utilizing a Full-text index</a> <font 
size="4"><a href="#toc">[Back to TOC]</a></font> ##
-
-When there is a full-text index on the field that is being searched, rather 
than scanning all records,
-AsterixDB can utilize that index to expedite the execution of a FTS query. To 
create a full-text index,
-you need to specify the index type as `fulltext` in your DDL statement. For 
instance, the following DDL
-statement create a full-text index on the `GleambookMessages.message` 
attribute. Note that a full-text index
-cannot be built on a dataset with the variable-length primary key (e.g., 
string).
-
-    use TinySocial;
-
-    create index messageFTSIdx on GleambookMessages(message) type fulltext;

http://git-wip-us.apache.org/repos/asf/asterixdb/blob/8bbf0813/asterixdb/asterix-doc/src/site/markdown/aql/similarity.md
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-doc/src/site/markdown/aql/similarity.md 
b/asterixdb/asterix-doc/src/site/markdown/aql/similarity.md
deleted file mode 100644
index 8118126..0000000
--- a/asterixdb/asterix-doc/src/site/markdown/aql/similarity.md
+++ /dev/null
@@ -1,227 +0,0 @@
-<!--
- ! Licensed to the Apache Software Foundation (ASF) under one
- ! or more contributor license agreements.  See the NOTICE file
- ! distributed with this work for additional information
- ! regarding copyright ownership.  The ASF licenses this file
- ! to you under the Apache License, Version 2.0 (the
- ! "License"); you may not use this file except in compliance
- ! with the License.  You may obtain a copy of the License at
- !
- !   http://www.apache.org/licenses/LICENSE-2.0
- !
- ! Unless required by applicable law or agreed to in writing,
- ! software distributed under the License is distributed on an
- ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- ! KIND, either express or implied.  See the License for the
- ! specific language governing permissions and limitations
- ! under the License.
- !-->
-
-# AsterixDB  Support of Similarity Queries #
-
-## <a id="toc">Table of Contents</a> ##
-
-* [Motivation](#Motivation)
-* [Data Types and Similarity Functions](#DataTypesAndSimilarityFunctions)
-* [Similarity Selection Queries](#SimilaritySelectionQueries)
-* [Similarity Join Queries](#SimilarityJoinQueries)
-* [Using Indexes to Support Similarity 
Queries](#UsingIndexesToSupportSimilarityQueries)
-
-## <a id="Motivation">Motivation</a> <font size="4"><a href="#toc">[Back to 
TOC]</a></font> ##
-
-Similarity queries are widely used in applications where users need to
-find objects that satisfy a similarity predicate, while exact matching
-is not sufficient. These queries are especially important for social
-and Web applications, where errors, abbreviations, and inconsistencies
-are common.  As an example, we may want to find all the movies
-starring Schwarzenegger, while we don't know the exact spelling of his
-last name (despite his popularity in both the movie industry and
-politics :-)). As another example, we want to find all the Facebook
-users who have similar friends. To meet this type of needs, AsterixDB
-supports similarity queries using efficient indexes and algorithms.
-
-## <a id="DataTypesAndSimilarityFunctions">Data Types and Similarity 
Functions</a> <font size="4"><a href="#toc">[Back to TOC]</a></font> ##
-
-AsterixDB supports [edit 
distance](http://en.wikipedia.org/wiki/Levenshtein_distance) (on strings) and
-[Jaccard](http://en.wikipedia.org/wiki/Jaccard_index) (on sets).  For
-instance, in our
-[TinySocial](../sqlpp/primer-sqlpp.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB)
-example, the `friendIds` of a Gleambook user forms a set
-of friends, and we can define a similarity between the sets of
-friends of two users. We can also convert a string to a set of grams of a 
length "n"
-(called "n-grams") and define the Jaccard similarity between the two
-gram sets of the two strings. Formally, the "n-grams" of a string are
-its substrings of length "n". For instance, the 3-grams of the string
-`schwarzenegger` are `sch`, `chw`, `hwa`, ..., `ger`.
-
-AsterixDB provides
-[tokenization functions](../sqlpp/builtins.html#Tokenizing_Functions)
-to convert strings to sets, and the
-[similarity functions](../sqlpp/builtins.html#Similarity_Functions).
-
-## <a id="SimilaritySelectionQueries">Similarity Selection Queries</a> <font 
size="4"><a href="#toc">[Back to TOC]</a></font> ##
-
-The following query
-asks for all the Gleambook users whose name is similar to
-`Suzanna Tilson`, i.e., their edit distance is at most 2.
-
-        use TinySocial;
-
-        select u
-        from GleambookUsers u
-        where edit_distance(u.name, "Suzanna Tilson") <= 2;
-
-The following query
-asks for all the Gleambook users whose set of friend ids is
-similar to `[1,5,9,10]`, i.e., their Jaccard similarity is at least 0.6.
-
-        use TinySocial;
-
-        select u
-        from GleambookUsers u
-        where similarity_jaccard(u.friendIds, [1,5,9,10]) >= 0.6f;
-
-AsterixDB allows a user to use a similarity operator `~=` to express a
-condition by defining the similarity function and threshold
-using "set" statements earlier. For instance, the above query can be
-equivalently written as:
-
-        use TinySocial;
-
-        set simfunction "jaccard";
-        set simthreshold "0.6f";
-
-        select u
-        from GleambookUsers u
-        where u.friendIds ~= [1,5,9,10];
-
-In this query, we first declare Jaccard as the similarity function
-using `simfunction` and then specify the threshold `0.6f` using
-`simthreshold`.
-
-## <a id="SimilarityJoinQueries">Similarity Join Queries</a> <font size="4"><a 
href="#toc">[Back to TOC]</a></font> ##
-
-AsterixDB supports fuzzy joins between two sets. The following
-[query](../sqlpp/primer-sqlpp.html#Query_5_-_Fuzzy_Join)
-finds, for each Gleambook user, all Chirp users with names
-similar to their name based on the edit distance.
-
-        use TinySocial;
-
-        set simfunction "edit-distance";
-        set simthreshold "3";
-
-        select gbu.id, gbu.name, (select cu.screenName, cu.name
-                                  from ChirpUsers cu
-                                  where cu.name ~= gbu.name) as similar_users
-        from GleambookUsers gbu;
-
-## <a id="UsingIndexesToSupportSimilarityQueries">Using Indexes to Support 
Similarity Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font> ##
-
-AsterixDB uses two types of indexes to support similarity queries, namely
-"ngram index" and "keyword index".
-
-### NGram Index ###
-
-An "ngram index" is constructed on a set of strings.  We generate n-grams for 
each string, and build an inverted
-list for each n-gram that includes the ids of the strings with this
-gram.  A similarity query can be answered efficiently by accessing the
-inverted lists of the grams in the query and counting the number of
-occurrences of the string ids on these inverted lists.  The similar
-idea can be used to answer queries with Jaccard similarity.  A
-detailed description of these techniques is available at this
-[paper](http://www.ics.uci.edu/~chenli/pub/icde2009-memreducer.pdf).
-
-For instance, the following DDL statements create an ngram index on the
-`GleambookUsers.name` attribute using an inverted index of 3-grams.
-
-        use TinySocial;
-
-        create index gbUserIdx on GleambookUsers(name) type ngram(3);
-
-The number "3" in "ngram(3)" is the length "n" in the grams. This
-index can be used to optimize similarity queries on this attribute
-using
-[edit_distance](../sqlpp/builtins.html#edit_distance),
-[edit_distance_check](../sqlpp/builtins.html#edit_distance_check),
-[similarity_jaccard](../sqlpp/builtins.html#similarity_jaccard),
-or [similarity_jaccard_check](../sqlpp/builtins.html#similarity_jaccard_check)
-queries on this attribute where the
-similarity is defined on sets of 3-grams.  This index can also be used
-to optimize queries with the "[contains()]((../sqlpp/builtins.html#contains))" 
predicate (i.e., substring
-matching) since it can be also be solved by counting on the inverted
-lists of the grams in the query string.
-
-#### NGram Index usage case - 
[edit_distance](../sqlpp/builtins.html#edit-distance) ####
-
-        use TinySocial;
-
-        select u
-        from GleambookUsers u
-        where edit_distance(u.name, "Suzanna Tilson") <= 2;
-
-#### NGram Index usage case - 
[edit_distance_check](../sqlpp/builtins.html#edit_distance_check) ####
-
-        use TinySocial;
-
-        select u
-        from GleambookUsers u
-        where edit_distance_check(u.name, "Suzanna Tilson", 2)[0];
-
-#### NGram Index usage case - [contains()]((../sqlpp/builtins.html#contains)) 
####
-
-        use TinySocial;
-
-        select m
-        from GleambookMessages m
-        where contains(m.message, "phone");
-
-
-### Keyword Index ###
-
-A "keyword index" is constructed on a set of strings or sets (e.g., array, 
multiset). Instead of
-generating grams as in an ngram index, we generate tokens (e.g., words) and 
for each token, construct an inverted list that includes the ids of the
-objects with this token.  The following two examples show how to create 
keyword index on two different types:
-
-
-#### Keyword Index on String Type ####
-
-        use TinySocial;
-
-        drop index GleambookMessages.gbMessageIdx if exists;
-        create index gbMessageIdx on GleambookMessages(message) type keyword;
-
-        select m
-        from GleambookMessages m
-        where similarity_jaccard_check(word_tokens(m.message), 
word_tokens("love like ccast"), 0.2f)[0];
-
-#### Keyword Index on Multiset Type ####
-
-        use TinySocial;
-
-        create index gbUserIdxFIds on GleambookUsers(friendIds) type keyword;
-
-        select u
-        from GleambookUsers u
-        where similarity_jaccard_check(u.friendIds, {{3,10}}, 0.5f)[0];
-
-As shown above, keyword index can be used to optimize queries with token-based 
similarity predicates, including
-[similarity_jaccard](../sqlpp/builtins.html#similarity_jaccard) and
-[similarity_jaccard_check](../sqlpp/builtins.html#similarity_jaccard_check).
-
-#### Keyword Index usage case - 
[similarity_jaccard](../sqlpp/builtins.html#similarity_jaccard) ####
-
-        use TinySocial;
-
-        select u
-        from GleambookUsers u
-        where similarity_jaccard(u.friendIds, [1,5,9,10]) >= 0.6f;
-
-#### Keyword Index usage case - 
[similarity_jaccard_check](../sqlpp/builtins.html#similarity_jaccard_check) ####
-
-        use TinySocial;
-
-        select u
-        from GleambookUsers u
-        where similarity_jaccard_check(u.friendIds, [1,5,9,10], 0.6f)[0];
-

http://git-wip-us.apache.org/repos/asf/asterixdb/blob/8bbf0813/asterixdb/asterix-doc/src/site/markdown/sqlpp/filters.md
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-doc/src/site/markdown/sqlpp/filters.md 
b/asterixdb/asterix-doc/src/site/markdown/sqlpp/filters.md
new file mode 100644
index 0000000..6b8e00f
--- /dev/null
+++ b/asterixdb/asterix-doc/src/site/markdown/sqlpp/filters.md
@@ -0,0 +1,147 @@
+<!--
+ ! Licensed to the Apache Software Foundation (ASF) under one
+ ! or more contributor license agreements.  See the NOTICE file
+ ! distributed with this work for additional information
+ ! regarding copyright ownership.  The ASF licenses this file
+ ! to you under the Apache License, Version 2.0 (the
+ ! "License"); you may not use this file except in compliance
+ ! with the License.  You may obtain a copy of the License at
+ !
+ !   http://www.apache.org/licenses/LICENSE-2.0
+ !
+ ! Unless required by applicable law or agreed to in writing,
+ ! software distributed under the License is distributed on an
+ ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ! KIND, either express or implied.  See the License for the
+ ! specific language governing permissions and limitations
+ ! under the License.
+ !-->
+
+# Filter-Based LSM Index Acceleration
+
+## <a id="toc">Table of Contents</a>
+
+* [Motivation](#Motivation)
+* [Filters in AsterixDB](#FiltersInAsterixDB)
+* [Filters and Merge Policies](#FiltersAndMergePolicies)
+
+## <a id="Motivation">Motivation</a> <font size="4"><a href="#toc">[Back to 
TOC]</a></font>
+
+Traditional relational databases usually employ conventional index
+structures such as B+ trees due to their low read latency.  However,
+such traditional index structures use in-place writes to perform
+updates, resulting in costly random writes to disk. Today's emerging
+applications often involve insert-intensive workloads for which the
+cost of random writes prohibits efficient ingestion of
+data. Consequently, popular NoSQL systems such as Cassandra, HBase,
+LevelDB, BigTable, etc. have adopted Log-Structured Merge (LSM) Trees
+as their storage structure. LSM-trees avoids the cost of random writes
+by batching updates into a component of the index that resides in main
+memory -- an *in-memory component*. When the space occupancy of
+the in-memory component exceeds a specified threshold, its entries are
+*flushed* to disk forming a new component -- a *disk component*. As
+disk components accumulate on disk, they are periodically merged
+together subject to a *merge policy* that decides when and what to
+merge. The benefit of the LSM-trees comes at the cost of possibly
+sacrificing read efficiency, but, it has been shown in previous
+studies that these inefficiencies can be mostly mitigated.
+
+AsterixDB has also embraced LSM-trees, not just by using them as
+primary indexes, but also by using the same LSM-ification technique
+for all of its secondary index structures. In particular, AsterixDB
+adopted a generic framework for converting a class of indexes (that
+includes conventional B+ trees, R trees, and inverted indexes) into
+LSM-based secondary indexes, allowing higher data ingestion rates. In
+fact, for certain index structures, our results have shown that using
+an LSM-based version of an index can be made to significantly
+outperform its conventional counterpart for *both* ingestion
+and query speed (an example of such an index being the R-tree for
+spatial data).
+
+Since an LSM-based index naturally partitions data into multiple disk
+components, it is possible, when answering certain queries, to exploit
+partitioning to only access some components and safely filter out the
+remaining components, thus reducing query times. For instance,
+referring to our
+[TinySocial](primer.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB)
+example, suppose a user always retrieves tweets from the
+`TweetMessages` dataset based on the `send-time` field (e.g., tweets
+posted in the last 24 hours). Since there is not a secondary index on
+the `send-time` field, the only available option for AsterixDB would
+be to scan the whole `TweetMessages` dataset and then apply the
+predicate as a post-processing step. However, if disk components of
+the primary index were tagged with the minimum and maximum timestamp
+values of the objects they contain, we could utilize the tagged
+information to directly access the primary index and prune components
+that do not match the query predicate. Thus, we could save substantial
+cost by avoiding scanning the whole dataset and only access the
+relevant components. We simply call such tagging information that are
+associated with components, filters. (Note that even if there were a
+secondary index on `send-time` field, using filters could save
+substantial cost by avoiding accessing the secondary index, followed
+by probing the primary index for every fetched entry.) Moreover, the
+same filtering technique can also be used with any secondary LSM index
+(e.g., an LSM R-tree), in case the query contains multiple predicates
+(e.g., spatial and temporal predicates), to obtain similar pruning
+power.
+
+## <a id="FiltersInAsterixDB">Filters in AsterixDB</a> <font size="4"><a 
href="#toc">[Back to TOC]</a></font>
+
+We have added support for LSM-based filters to all of AsterixDB's
+index types. To enable the use of filters, the user must specify the
+filter's key when creating a dataset, as shown below:
+
+#### Creating a Dataset with a Filter  ####
+
+        create dataset Tweets(TweetType) primary key tweetid with filter on 
send-time;
+
+Filters can be created on any totally ordered datatype (i.e., any
+field that can be indexed using a B+ -tree), such as integers,
+doubles, floats, UUIDs, datetimes, etc.
+
+When a dataset with a filter is created, the name of the filter's key
+field is persisted in the `Metadata.Dataset` dataset (which is the metadata
+dataset that stores the details of each dataset in an AsterixDB
+instance) so that DML operations against the dataset can recognize the
+existence of filters and can update them or utilize them
+accordingly. Creating a dataset with a filter in AsterixDB implies
+that the primary and all secondary indexes of that dataset will
+maintain filters on their disk components. Once a filtered dataset is
+created, the user can use the dataset normally (just like any other
+dataset). AsterixDB will automatically maintain the filters and will
+leverage them to efficiently answer queries whenever possible (i.e.,
+when a query has predicates on the filter's key).
+
+## <a id="FiltersAndMergePolicies">Filters and Merge Policies</a> <font 
size="4"><a href="#toc">[Back to TOC]</a></font>
+
+The AsterixDB default merge policy, the prefix merge policy, relies on
+component sizes and the number of components to decide which
+components to merge. This merge policy has proven to provide excellent
+performance for both ingestion and queries. However, when evaluating
+our filtering solution with the prefix policy, we observed a behavior
+that can reduce filter effectiveness. In particular, we noticed that
+under the prefix merge policy, the disk components of a secondary
+index tend to be constantly merged into a single component. This is
+because the prefix policy relies on a single size parameter for all of
+the indexes of a dataset. This parameter is typically chosen based on
+the sizes of the disk components of the primary index, which tend to
+be much larger than the sizes of the secondary indexes' disk
+components. This difference caused the prefix merge policy to behave
+similarly to the constant merge policy (i.e., relatively poorly) when
+applied to secondary indexes in the sense that the secondary indexes
+are constantly merged into a single disk component. Consequently, the
+effectiveness of filters on secondary indexes was greatly reduced
+under the prefix-merge policy, but they were still effective when
+probing the primary index.  Based on this behavior, we developed a new
+merge policy, an improved version of the prefix policy, called the
+correlated-prefix policy. The basic idea of this policy is that it
+delegates the decision of merging the disk components of all the
+indexes in a dataset to the primary index. When the policy decides
+that the primary index needs to be merged (using the same decision
+criteria as for the prefix policy), then it will issue successive
+merge requests to the I/O scheduler on behalf of all other indexes
+associated with the same dataset. The end result is that secondary
+indexes will always have the same number of disk components as their
+primary index under the correlated-prefix merge policy. This has
+improved query performance, since disk components of secondary indexes
+now have a much better chance of being pruned.

http://git-wip-us.apache.org/repos/asf/asterixdb/blob/8bbf0813/asterixdb/asterix-doc/src/site/markdown/sqlpp/fulltext.md
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-doc/src/site/markdown/sqlpp/fulltext.md 
b/asterixdb/asterix-doc/src/site/markdown/sqlpp/fulltext.md
new file mode 100644
index 0000000..1328ed9
--- /dev/null
+++ b/asterixdb/asterix-doc/src/site/markdown/sqlpp/fulltext.md
@@ -0,0 +1,114 @@
+<!--
+ ! Licensed to the Apache Software Foundation (ASF) under one
+ ! or more contributor license agreements.  See the NOTICE file
+ ! distributed with this work for additional information
+ ! regarding copyright ownership.  The ASF licenses this file
+ ! to you under the Apache License, Version 2.0 (the
+ ! "License"); you may not use this file except in compliance
+ ! with the License.  You may obtain a copy of the License at
+ !
+ !   http://www.apache.org/licenses/LICENSE-2.0
+ !
+ ! Unless required by applicable law or agreed to in writing,
+ ! software distributed under the License is distributed on an
+ ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ! KIND, either express or implied.  See the License for the
+ ! specific language governing permissions and limitations
+ ! under the License.
+ !-->
+
+# AsterixDB  Support of Full-text search queries #
+
+## <a id="toc">Table of Contents</a> ##
+
+* [Motivation](#Motivation)
+* [Syntax](#Syntax)
+* [Creating and utilizing a Full-text index](#FulltextIndex)
+
+## <a id="Motivation">Motivation</a> <font size="4"><a href="#toc">[Back to 
TOC]</a></font> ##
+
+Full-Text Search (FTS) queries are widely used in applications where users 
need to find records that satisfy
+an FTS predicate, i.e., where simple string-based matching is not sufficient. 
These queries are important when
+finding documents that contain a certain keyword is crucial. FTS queries are 
different from substring matching
+queries in that FTS queries find their query predicates as exact keywords in 
the given string, rather than
+treating a query predicate as a sequence of characters. For example, an FTS 
query that finds “rain” correctly
+returns a document when it contains “rain” as a word. However, a 
substring-matching query returns a document
+whenever it contains “rain” as a substring, for instance, a document with 
“brain” or “training” would be
+returned as well.
+
+## <a id="Syntax">Syntax</a> <font size="4"><a href="#toc">[Back to 
TOC]</a></font> ##
+
+The syntax of AsterixDB FTS follows a portion of the XQuery FullText Search 
syntax.
+Two basic forms are as follows:
+
+        ftcontains(Expression1, Expression2, {FullTextOption})
+        ftcontains(Expression1, Expression2)
+
+For example, we can execute the following query to find Chirp messages where 
the `messageText` field includes
+“voice” as a word. Please note that an FTS search is case-insensitive.
+Thus, "Voice" or "voice" will be evaluated as the same word.
+
+        use TinySocial;
+
+        select element {"chirpId": msg.chirpId}
+        from ChirpMessages msg
+        where ftcontains(msg.messageText, "voice", {"mode":"any"});
+
+The DDL and DML of TinySocial can be found in [ADM: Modeling Semistructed Data 
in 
AsterixDB](../sqlpp/primer-sqlpp.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB).
+
+The `Expression1` is an expression that should be evaluable as a string at 
runtime as in the above example
+where `msg.messageText` is a string field. The `Expression2` can be a string, 
an (un)ordered list
+of string value(s), or an expression. In the last case, the given expression 
should be evaluable
+into one of the first two types, i.e., into a string value or an (un)ordered 
list of string value(s).
+
+The following examples are all valid expressions.
+
+       ... where ftcontains(msg.messageText, "sound")
+       ... where ftcontains(msg.messageText, "sound", {"mode":"any"})
+       ... where ftcontains(msg.messageText, ["sound", "system"], 
{"mode":"any"})
+       ... where ftcontains(msg.messageText, {{"speed", "stand", 
"customization"}}, {"mode":"all"})
+
+The last `FullTextOption` parameter clarifies the given FTS request. If you 
omit the `FullTextOption` parameter,
+then the default value will be set for each possible option. Currently, we 
only have one option named `mode`.
+And as we extend the FTS feature, more options will be added. Please note that 
the format of `FullTextOption`
+is a record, thus you need to put the option(s) in a record `{}`.
+The `mode` option indicates whether the given FTS query is a conjunctive (AND) 
or disjunctive (OR) search request.
+This option can be either `“all”` (AND) or `“any”` (OR). The default 
value for `mode` is `“all”`. If one specifies `“any”`,
+a disjunctive search will be conducted. For example, the following query will 
find documents whose `messageText`
+field contains “sound” or “system”, so a document will be returned if 
it contains either “sound”, “system”,
+or both of the keywords.
+
+       ... where ftcontains(msg.messageText, ["sound", "system"], 
{"mode":"any"})
+
+The other option parameter,`“all”`, specifies a conjunctive search. The 
following examples will find the documents whose
+`messageText` field contains both “sound” and “system”. If a document 
contains only “sound” or “system” but
+not both, it will not be returned.
+
+       ... where ftcontains(msg.messageText, ["sound", "system"], 
{"mode":"all"})
+       ... where ftcontains(msg.messageText, ["sound", "system"])
+
+Currently AsterixDB doesn’t (yet) support phrase searches, so the following 
query will not work.
+
+       ... where ftcontains(msg.messageText, "sound system", {"mode":"any"})
+
+As a workaround solution, the following query can be used to achieve a roughly 
similar goal. The difference is that
+the following queries will find documents where `msg.messageText` contains 
both “sound” and “system”, but the order
+and adjacency of “sound” and “system” are not checked, unlike in a 
phrase search. As a result, the query below would
+also return documents with “sound system can be installed.”, “system 
sound is perfect.”,
+or “sound is not clear. You may need to install a new system.”
+
+       ... where ftcontains(msg.messageText, ["sound", "system"], 
{"mode":"all"})
+       ... where ftcontains(msg.messageText, ["sound", "system"])
+
+
+## <a id="FulltextIndex">Creating and utilizing a Full-text index</a> <font 
size="4"><a href="#toc">[Back to TOC]</a></font> ##
+
+When there is a full-text index on the field that is being searched, rather 
than scanning all records,
+AsterixDB can utilize that index to expedite the execution of a FTS query. To 
create a full-text index,
+you need to specify the index type as `fulltext` in your DDL statement. For 
instance, the following DDL
+statement create a full-text index on the `GleambookMessages.message` 
attribute. Note that a full-text index
+cannot be built on a dataset with the variable-length primary key (e.g., 
string).
+
+    use TinySocial;
+
+    create index messageFTSIdx on GleambookMessages(message) type fulltext;

http://git-wip-us.apache.org/repos/asf/asterixdb/blob/8bbf0813/asterixdb/asterix-doc/src/site/markdown/sqlpp/similarity.md
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-doc/src/site/markdown/sqlpp/similarity.md 
b/asterixdb/asterix-doc/src/site/markdown/sqlpp/similarity.md
new file mode 100644
index 0000000..8118126
--- /dev/null
+++ b/asterixdb/asterix-doc/src/site/markdown/sqlpp/similarity.md
@@ -0,0 +1,227 @@
+<!--
+ ! Licensed to the Apache Software Foundation (ASF) under one
+ ! or more contributor license agreements.  See the NOTICE file
+ ! distributed with this work for additional information
+ ! regarding copyright ownership.  The ASF licenses this file
+ ! to you under the Apache License, Version 2.0 (the
+ ! "License"); you may not use this file except in compliance
+ ! with the License.  You may obtain a copy of the License at
+ !
+ !   http://www.apache.org/licenses/LICENSE-2.0
+ !
+ ! Unless required by applicable law or agreed to in writing,
+ ! software distributed under the License is distributed on an
+ ! "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ! KIND, either express or implied.  See the License for the
+ ! specific language governing permissions and limitations
+ ! under the License.
+ !-->
+
+# AsterixDB  Support of Similarity Queries #
+
+## <a id="toc">Table of Contents</a> ##
+
+* [Motivation](#Motivation)
+* [Data Types and Similarity Functions](#DataTypesAndSimilarityFunctions)
+* [Similarity Selection Queries](#SimilaritySelectionQueries)
+* [Similarity Join Queries](#SimilarityJoinQueries)
+* [Using Indexes to Support Similarity 
Queries](#UsingIndexesToSupportSimilarityQueries)
+
+## <a id="Motivation">Motivation</a> <font size="4"><a href="#toc">[Back to 
TOC]</a></font> ##
+
+Similarity queries are widely used in applications where users need to
+find objects that satisfy a similarity predicate, while exact matching
+is not sufficient. These queries are especially important for social
+and Web applications, where errors, abbreviations, and inconsistencies
+are common.  As an example, we may want to find all the movies
+starring Schwarzenegger, while we don't know the exact spelling of his
+last name (despite his popularity in both the movie industry and
+politics :-)). As another example, we want to find all the Facebook
+users who have similar friends. To meet this type of needs, AsterixDB
+supports similarity queries using efficient indexes and algorithms.
+
+## <a id="DataTypesAndSimilarityFunctions">Data Types and Similarity 
Functions</a> <font size="4"><a href="#toc">[Back to TOC]</a></font> ##
+
+AsterixDB supports [edit 
distance](http://en.wikipedia.org/wiki/Levenshtein_distance) (on strings) and
+[Jaccard](http://en.wikipedia.org/wiki/Jaccard_index) (on sets).  For
+instance, in our
+[TinySocial](../sqlpp/primer-sqlpp.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB)
+example, the `friendIds` of a Gleambook user forms a set
+of friends, and we can define a similarity between the sets of
+friends of two users. We can also convert a string to a set of grams of a 
length "n"
+(called "n-grams") and define the Jaccard similarity between the two
+gram sets of the two strings. Formally, the "n-grams" of a string are
+its substrings of length "n". For instance, the 3-grams of the string
+`schwarzenegger` are `sch`, `chw`, `hwa`, ..., `ger`.
+
+AsterixDB provides
+[tokenization functions](../sqlpp/builtins.html#Tokenizing_Functions)
+to convert strings to sets, and the
+[similarity functions](../sqlpp/builtins.html#Similarity_Functions).
+
+## <a id="SimilaritySelectionQueries">Similarity Selection Queries</a> <font 
size="4"><a href="#toc">[Back to TOC]</a></font> ##
+
+The following query
+asks for all the Gleambook users whose name is similar to
+`Suzanna Tilson`, i.e., their edit distance is at most 2.
+
+        use TinySocial;
+
+        select u
+        from GleambookUsers u
+        where edit_distance(u.name, "Suzanna Tilson") <= 2;
+
+The following query
+asks for all the Gleambook users whose set of friend ids is
+similar to `[1,5,9,10]`, i.e., their Jaccard similarity is at least 0.6.
+
+        use TinySocial;
+
+        select u
+        from GleambookUsers u
+        where similarity_jaccard(u.friendIds, [1,5,9,10]) >= 0.6f;
+
+AsterixDB allows a user to use a similarity operator `~=` to express a
+condition by defining the similarity function and threshold
+using "set" statements earlier. For instance, the above query can be
+equivalently written as:
+
+        use TinySocial;
+
+        set simfunction "jaccard";
+        set simthreshold "0.6f";
+
+        select u
+        from GleambookUsers u
+        where u.friendIds ~= [1,5,9,10];
+
+In this query, we first declare Jaccard as the similarity function
+using `simfunction` and then specify the threshold `0.6f` using
+`simthreshold`.
+
+## <a id="SimilarityJoinQueries">Similarity Join Queries</a> <font size="4"><a 
href="#toc">[Back to TOC]</a></font> ##
+
+AsterixDB supports fuzzy joins between two sets. The following
+[query](../sqlpp/primer-sqlpp.html#Query_5_-_Fuzzy_Join)
+finds, for each Gleambook user, all Chirp users with names
+similar to their name based on the edit distance.
+
+        use TinySocial;
+
+        set simfunction "edit-distance";
+        set simthreshold "3";
+
+        select gbu.id, gbu.name, (select cu.screenName, cu.name
+                                  from ChirpUsers cu
+                                  where cu.name ~= gbu.name) as similar_users
+        from GleambookUsers gbu;
+
+## <a id="UsingIndexesToSupportSimilarityQueries">Using Indexes to Support 
Similarity Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font> ##
+
+AsterixDB uses two types of indexes to support similarity queries, namely
+"ngram index" and "keyword index".
+
+### NGram Index ###
+
+An "ngram index" is constructed on a set of strings.  We generate n-grams for 
each string, and build an inverted
+list for each n-gram that includes the ids of the strings with this
+gram.  A similarity query can be answered efficiently by accessing the
+inverted lists of the grams in the query and counting the number of
+occurrences of the string ids on these inverted lists.  The similar
+idea can be used to answer queries with Jaccard similarity.  A
+detailed description of these techniques is available at this
+[paper](http://www.ics.uci.edu/~chenli/pub/icde2009-memreducer.pdf).
+
+For instance, the following DDL statements create an ngram index on the
+`GleambookUsers.name` attribute using an inverted index of 3-grams.
+
+        use TinySocial;
+
+        create index gbUserIdx on GleambookUsers(name) type ngram(3);
+
+The number "3" in "ngram(3)" is the length "n" in the grams. This
+index can be used to optimize similarity queries on this attribute
+using
+[edit_distance](../sqlpp/builtins.html#edit_distance),
+[edit_distance_check](../sqlpp/builtins.html#edit_distance_check),
+[similarity_jaccard](../sqlpp/builtins.html#similarity_jaccard),
+or [similarity_jaccard_check](../sqlpp/builtins.html#similarity_jaccard_check)
+queries on this attribute where the
+similarity is defined on sets of 3-grams.  This index can also be used
+to optimize queries with the "[contains()]((../sqlpp/builtins.html#contains))" 
predicate (i.e., substring
+matching) since it can be also be solved by counting on the inverted
+lists of the grams in the query string.
+
+#### NGram Index usage case - 
[edit_distance](../sqlpp/builtins.html#edit-distance) ####
+
+        use TinySocial;
+
+        select u
+        from GleambookUsers u
+        where edit_distance(u.name, "Suzanna Tilson") <= 2;
+
+#### NGram Index usage case - 
[edit_distance_check](../sqlpp/builtins.html#edit_distance_check) ####
+
+        use TinySocial;
+
+        select u
+        from GleambookUsers u
+        where edit_distance_check(u.name, "Suzanna Tilson", 2)[0];
+
+#### NGram Index usage case - [contains()]((../sqlpp/builtins.html#contains)) 
####
+
+        use TinySocial;
+
+        select m
+        from GleambookMessages m
+        where contains(m.message, "phone");
+
+
+### Keyword Index ###
+
+A "keyword index" is constructed on a set of strings or sets (e.g., array, 
multiset). Instead of
+generating grams as in an ngram index, we generate tokens (e.g., words) and 
for each token, construct an inverted list that includes the ids of the
+objects with this token.  The following two examples show how to create 
keyword index on two different types:
+
+
+#### Keyword Index on String Type ####
+
+        use TinySocial;
+
+        drop index GleambookMessages.gbMessageIdx if exists;
+        create index gbMessageIdx on GleambookMessages(message) type keyword;
+
+        select m
+        from GleambookMessages m
+        where similarity_jaccard_check(word_tokens(m.message), 
word_tokens("love like ccast"), 0.2f)[0];
+
+#### Keyword Index on Multiset Type ####
+
+        use TinySocial;
+
+        create index gbUserIdxFIds on GleambookUsers(friendIds) type keyword;
+
+        select u
+        from GleambookUsers u
+        where similarity_jaccard_check(u.friendIds, {{3,10}}, 0.5f)[0];
+
+As shown above, keyword index can be used to optimize queries with token-based 
similarity predicates, including
+[similarity_jaccard](../sqlpp/builtins.html#similarity_jaccard) and
+[similarity_jaccard_check](../sqlpp/builtins.html#similarity_jaccard_check).
+
+#### Keyword Index usage case - 
[similarity_jaccard](../sqlpp/builtins.html#similarity_jaccard) ####
+
+        use TinySocial;
+
+        select u
+        from GleambookUsers u
+        where similarity_jaccard(u.friendIds, [1,5,9,10]) >= 0.6f;
+
+#### Keyword Index usage case - 
[similarity_jaccard_check](../sqlpp/builtins.html#similarity_jaccard_check) ####
+
+        use TinySocial;
+
+        select u
+        from GleambookUsers u
+        where similarity_jaccard_check(u.friendIds, [1,5,9,10], 0.6f)[0];
+

http://git-wip-us.apache.org/repos/asf/asterixdb/blob/8bbf0813/asterixdb/asterix-doc/src/site/site.xml
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-doc/src/site/site.xml 
b/asterixdb/asterix-doc/src/site/site.xml
index 99947ad..1167c37 100644
--- a/asterixdb/asterix-doc/src/site/site.xml
+++ b/asterixdb/asterix-doc/src/site/site.xml
@@ -71,36 +71,36 @@
     </menu>
 
     <menu name = "AsterixDB Primer">
-      <item name="Option 1: using SQL++" href="sqlpp/primer-sqlpp.html"/>
-      <item name="Option 2: using AQL" href="aql/primer.html"/>
+      <item name="Using SQL++" href="sqlpp/primer-sqlpp.html"/>
     </menu>
 
     <menu name="Data Model">
       <item name="The Asterix Data Model" href="datamodel.html"/>
     </menu>
 
-    <menu name="Queries - SQL++">
+    <menu name="Queries">
       <item name="The SQL++ Query Language" href="sqlpp/manual.html"/>
       <item name="Builtin Functions" href="sqlpp/builtins.html"/>
     </menu>
 
-    <menu name="Queries - AQL">
-      <item name="The Asterix Query Language (AQL)" href="aql/manual.html"/>
-      <item name="Builtin Functions" href="aql/builtins.html"/>
-    </menu>
-
     <menu name="API/SDK">
       <item name="HTTP API" href="api.html"/>
       <item name="CSV Output" href="csv.html"/>
     </menu>
 
     <menu name="Advanced Features">
-      <item name="Support of Full-text Queries" href="aql/fulltext.html"/>
       <item name="Accessing External Data" href="aql/externaldata.html"/>
       <item name="Support for Data Ingestion" href="feeds/tutorial.html"/>
       <item name="User Defined Functions" href="udf.html"/>
-      <item name="Filter-Based LSM Index Acceleration" 
href="aql/filters.html"/>
-      <item name="Support of Similarity Queries" href="aql/similarity.html"/>
+      <item name="Filter-Based LSM Index Acceleration" 
href="sqlpp/filters.html"/>
+      <item name="Support of Full-text Queries" href="sqlpp/fulltext.html"/>
+      <item name="Support of Similarity Queries" href="sqlpp/similarity.html"/>
+    </menu>
+
+    <menu name="Deprecated">
+      <item name="AsterixDB Primer: Using AQL" href="aql/primer.html"/>
+      <item name="Queries: The Asterix Query Language (AQL)" 
href="aql/manual.html"/>
+      <item name="Queries: Builtin Functions (AQL)" href="aql/builtins.html"/>
     </menu>
 
     <menu ref="reports"/>

Reply via email to