Repository: asterixdb
Updated Branches:
  refs/heads/master 557193477 -> c1f39fe12


[ASTERIXDB-2349][SITE] Revise fulltext and similarity documentation

- user model changes: no
- storage format changes: no
- interface changes: no

Details: Update all examples in the fulltext and similarity documentation
using SQLPP.

Change-Id: Icd9c5bc6249feb03b4297bdc84b5f3aa0efcdc47
Reviewed-on: https://asterix-gerrit.ics.uci.edu/2567
Sonar-Qube: Jenkins <jenk...@fulliautomatix.ics.uci.edu>
Tested-by: Jenkins <jenk...@fulliautomatix.ics.uci.edu>
Contrib: Jenkins <jenk...@fulliautomatix.ics.uci.edu>
Integration-Tests: Jenkins <jenk...@fulliautomatix.ics.uci.edu>
Reviewed-by: Ian Maxon <ima...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/asterixdb/repo
Commit: http://git-wip-us.apache.org/repos/asf/asterixdb/commit/c1f39fe1
Tree: http://git-wip-us.apache.org/repos/asf/asterixdb/tree/c1f39fe1
Diff: http://git-wip-us.apache.org/repos/asf/asterixdb/diff/c1f39fe1

Branch: refs/heads/master
Commit: c1f39fe12de476a877e127c06993706feb1030f9
Parents: 5571934
Author: Taewoo Kim <wangs...@yahoo.com>
Authored: Wed Apr 4 10:58:04 2018 -0700
Committer: Taewoo Kim <wangs...@gmail.com>
Committed: Wed Apr 11 13:01:18 2018 -0700

----------------------------------------------------------------------
 .../src/site/markdown/aql/fulltext.md           |  61 +++----
 .../src/site/markdown/aql/similarity.md         | 175 +++++++++----------
 2 files changed, 104 insertions(+), 132 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/asterixdb/blob/c1f39fe1/asterixdb/asterix-doc/src/site/markdown/aql/fulltext.md
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-doc/src/site/markdown/aql/fulltext.md 
b/asterixdb/asterix-doc/src/site/markdown/aql/fulltext.md
index bc0b398..1328ed9 100644
--- a/asterixdb/asterix-doc/src/site/markdown/aql/fulltext.md
+++ b/asterixdb/asterix-doc/src/site/markdown/aql/fulltext.md
@@ -44,73 +44,61 @@ Two basic forms are as follows:
         ftcontains(Expression1, Expression2, {FullTextOption})
         ftcontains(Expression1, Expression2)
 
-For example, we can execute the following query to find tweet messages where 
the `message-text` field includes
+For example, we can execute the following query to find Chirp messages where 
the `messageText` field includes
 “voice” as a word. Please note that an FTS search is case-insensitive.
 Thus, "Voice" or "voice" will be evaluated as the same word.
 
-        use dataverse TinySocial;
-
-        for $msg in dataset TweetMessages
-        where ftcontains($msg.message-text, "voice", {"mode":"any"})
-        return {"id": $msg.id}
-
-The DDL and DML of TinySocial can be found in [ADM: Modeling Semistructed Data 
in AsterixDB](primer.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB).
-
-The same query can be also expressed in the SQL++.
-
         use TinySocial;
 
-        select element {"id":msg.id}
-        from TweetMessages as msg
-        where TinySocial.ftcontains(msg.`message-text`, "voice", 
{"mode":"any"})
+        select element {"chirpId": msg.chirpId}
+        from ChirpMessages msg
+        where ftcontains(msg.messageText, "voice", {"mode":"any"});
+
+The DDL and DML of TinySocial can be found in [ADM: Modeling Semistructed Data 
in 
AsterixDB](../sqlpp/primer-sqlpp.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB).
 
 The `Expression1` is an expression that should be evaluable as a string at 
runtime as in the above example
-where `$msg.message-text` is a string field. The `Expression2` can be a 
string, an (un)ordered list
+where `msg.messageText` is a string field. The `Expression2` can be a string, 
an (un)ordered list
 of string value(s), or an expression. In the last case, the given expression 
should be evaluable
 into one of the first two types, i.e., into a string value or an (un)ordered 
list of string value(s).
 
 The following examples are all valid expressions.
 
-       ... where ftcontains($msg.message-text, "sound")
-       ... where ftcontains($msg.message-text, "sound", {"mode":"any"})
-       ... where ftcontains($msg.message-text, ["sound", "system"], 
{"mode":"any"})
-       ... where ftcontains($msg.message-text, {{"speed", "stand", 
"customization"}}, {"mode":"all"})
-       ... where ftcontains($msg.message-text, let $keyword_list := ["voice", 
"system"] return $keyword_list, {"mode":"all"})
-       ... where ftcontains($msg.message-text, $keyword_list, {"mode":"any"})
-
-In the last example above, `$keyword_list` should evaluate to a string or an 
(un)ordered list of string value(s).
+       ... where ftcontains(msg.messageText, "sound")
+       ... where ftcontains(msg.messageText, "sound", {"mode":"any"})
+       ... where ftcontains(msg.messageText, ["sound", "system"], 
{"mode":"any"})
+       ... where ftcontains(msg.messageText, {{"speed", "stand", 
"customization"}}, {"mode":"all"})
 
 The last `FullTextOption` parameter clarifies the given FTS request. If you 
omit the `FullTextOption` parameter,
 then the default value will be set for each possible option. Currently, we 
only have one option named `mode`.
 And as we extend the FTS feature, more options will be added. Please note that 
the format of `FullTextOption`
 is a record, thus you need to put the option(s) in a record `{}`.
 The `mode` option indicates whether the given FTS query is a conjunctive (AND) 
or disjunctive (OR) search request.
-This option can be either `“any”` or `“all”`. The default value for 
`mode` is `“all”`. If one specifies `“any”`,
-a disjunctive search will be conducted. For example, the following query will 
find documents whose `message-text`
+This option can be either `“all”` (AND) or `“any”` (OR). The default 
value for `mode` is `“all”`. If one specifies `“any”`,
+a disjunctive search will be conducted. For example, the following query will 
find documents whose `messageText`
 field contains “sound” or “system”, so a document will be returned if 
it contains either “sound”, “system”,
 or both of the keywords.
 
-       ... where ftcontains($msg.message-text, ["sound", "system"], 
{"mode":"any"})
+       ... where ftcontains(msg.messageText, ["sound", "system"], 
{"mode":"any"})
 
 The other option parameter,`“all”`, specifies a conjunctive search. The 
following examples will find the documents whose
-`message-text` field contains both “sound” and “system”. If a document 
contains only “sound” or “system” but
+`messageText` field contains both “sound” and “system”. If a document 
contains only “sound” or “system” but
 not both, it will not be returned.
 
-       ... where ftcontains($msg.message-text, ["sound", "system"], 
{"mode":"all"})
-       ... where ftcontains($msg.message-text, ["sound", "system"])
+       ... where ftcontains(msg.messageText, ["sound", "system"], 
{"mode":"all"})
+       ... where ftcontains(msg.messageText, ["sound", "system"])
 
 Currently AsterixDB doesn’t (yet) support phrase searches, so the following 
query will not work.
 
-       ... where ftcontains($msg.message-text, "sound system", {"mode":"any"})
+       ... where ftcontains(msg.messageText, "sound system", {"mode":"any"})
 
 As a workaround solution, the following query can be used to achieve a roughly 
similar goal. The difference is that
-the following queries will find documents where `$msg.message-text` contains 
both “sound” and “system”, but the order
+the following queries will find documents where `msg.messageText` contains 
both “sound” and “system”, but the order
 and adjacency of “sound” and “system” are not checked, unlike in a 
phrase search. As a result, the query below would
 also return documents with “sound system can be installed.”, “system 
sound is perfect.”,
 or “sound is not clear. You may need to install a new system.”
 
-       ... where ftcontains($msg.message-text, ["sound", "system"], 
{"mode":"all"})
-       ... where ftcontains($msg.message-text, ["sound", "system"])
+       ... where ftcontains(msg.messageText, ["sound", "system"], 
{"mode":"all"})
+       ... where ftcontains(msg.messageText, ["sound", "system"])
 
 
 ## <a id="FulltextIndex">Creating and utilizing a Full-text index</a> <font 
size="4"><a href="#toc">[Back to TOC]</a></font> ##
@@ -118,6 +106,9 @@ or “sound is not clear. You may need to install a new 
system.”
 When there is a full-text index on the field that is being searched, rather 
than scanning all records,
 AsterixDB can utilize that index to expedite the execution of a FTS query. To 
create a full-text index,
 you need to specify the index type as `fulltext` in your DDL statement. For 
instance, the following DDL
-statement create a full-text index on the TweetMessages.message-text attribute.
+statement create a full-text index on the `GleambookMessages.message` 
attribute. Note that a full-text index
+cannot be built on a dataset with the variable-length primary key (e.g., 
string).
+
+    use TinySocial;
 
-    create index messageFTSIdx on TweetMessages(message-text) type fulltext;
+    create index messageFTSIdx on GleambookMessages(message) type fulltext;

http://git-wip-us.apache.org/repos/asf/asterixdb/blob/c1f39fe1/asterixdb/asterix-doc/src/site/markdown/aql/similarity.md
----------------------------------------------------------------------
diff --git a/asterixdb/asterix-doc/src/site/markdown/aql/similarity.md 
b/asterixdb/asterix-doc/src/site/markdown/aql/similarity.md
index 0d949db..8118126 100644
--- a/asterixdb/asterix-doc/src/site/markdown/aql/similarity.md
+++ b/asterixdb/asterix-doc/src/site/markdown/aql/similarity.md
@@ -45,8 +45,8 @@ supports similarity queries using efficient indexes and 
algorithms.
 AsterixDB supports [edit 
distance](http://en.wikipedia.org/wiki/Levenshtein_distance) (on strings) and
 [Jaccard](http://en.wikipedia.org/wiki/Jaccard_index) (on sets).  For
 instance, in our
-[TinySocial](primer.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB)
-example, the `friend-ids` of a Facebook user forms a set
+[TinySocial](../sqlpp/primer-sqlpp.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB)
+example, the `friendIds` of a Gleambook user forms a set
 of friends, and we can define a similarity between the sets of
 friends of two users. We can also convert a string to a set of grams of a 
length "n"
 (called "n-grams") and define the Jaccard similarity between the two
@@ -55,50 +55,45 @@ its substrings of length "n". For instance, the 3-grams of 
the string
 `schwarzenegger` are `sch`, `chw`, `hwa`, ..., `ger`.
 
 AsterixDB provides
-[tokenization functions](functions.html#Tokenizing_Functions)
+[tokenization functions](../sqlpp/builtins.html#Tokenizing_Functions)
 to convert strings to sets, and the
-[similarity functions](functions.html#Similarity_Functions).
+[similarity functions](../sqlpp/builtins.html#Similarity_Functions).
 
 ## <a id="SimilaritySelectionQueries">Similarity Selection Queries</a> <font 
size="4"><a href="#toc">[Back to TOC]</a></font> ##
 
 The following query
-asks for all the Facebook users whose name is similar to
+asks for all the Gleambook users whose name is similar to
 `Suzanna Tilson`, i.e., their edit distance is at most 2.
 
-        use dataverse TinySocial;
-
-        for $user in dataset('FacebookUsers')
-        let $ed := edit-distance($user.name, "Suzanna Tilson")
-        where $ed <= 2
-        return $user
+        use TinySocial;
 
+        select u
+        from GleambookUsers u
+        where edit_distance(u.name, "Suzanna Tilson") <= 2;
 
 The following query
-asks for all the Facebook users whose set of friend ids is
+asks for all the Gleambook users whose set of friend ids is
 similar to `[1,5,9,10]`, i.e., their Jaccard similarity is at least 0.6.
 
-        use dataverse TinySocial;
-
-        for $user in dataset('FacebookUsers')
-        let $sim := similarity-jaccard($user.friend-ids, [1,5,9,10])
-        where $sim >= 0.6f
-        return $user
+        use TinySocial;
 
+        select u
+        from GleambookUsers u
+        where similarity_jaccard(u.friendIds, [1,5,9,10]) >= 0.6f;
 
 AsterixDB allows a user to use a similarity operator `~=` to express a
 condition by defining the similarity function and threshold
 using "set" statements earlier. For instance, the above query can be
 equivalently written as:
 
-        use dataverse TinySocial;
+        use TinySocial;
 
         set simfunction "jaccard";
         set simthreshold "0.6f";
 
-        for $user in dataset('FacebookUsers')
-        where $user.friend-ids ~= [1,5,9,10]
-        return $user
-
+        select u
+        from GleambookUsers u
+        where u.friendIds ~= [1,5,9,10];
 
 In this query, we first declare Jaccard as the similarity function
 using `simfunction` and then specify the threshold `0.6f` using
@@ -107,27 +102,19 @@ using `simfunction` and then specify the threshold `0.6f` 
using
 ## <a id="SimilarityJoinQueries">Similarity Join Queries</a> <font size="4"><a 
href="#toc">[Back to TOC]</a></font> ##
 
 AsterixDB supports fuzzy joins between two sets. The following
-[query](primer.html#Query_5_-_Fuzzy_Join)
-finds, for each Facebook user, all Twitter users with names
+[query](../sqlpp/primer-sqlpp.html#Query_5_-_Fuzzy_Join)
+finds, for each Gleambook user, all Chirp users with names
 similar to their name based on the edit distance.
 
-        use dataverse TinySocial;
+        use TinySocial;
 
         set simfunction "edit-distance";
         set simthreshold "3";
 
-        for $fbu in dataset FacebookUsers
-        return {
-            "id": $fbu.id,
-            "name": $fbu.name,
-            "similar-users": for $t in dataset TweetMessages
-                                let $tu := $t.user
-                                where $tu.name ~= $fbu.name
-                                return {
-                                "twitter-screenname": $tu.screen-name,
-                                "twitter-name": $tu.name
-                                }
-        };
+        select gbu.id, gbu.name, (select cu.screenName, cu.name
+                                  from ChirpUsers cu
+                                  where cu.name ~= gbu.name) as similar_users
+        from GleambookUsers gbu;
 
 ## <a id="UsingIndexesToSupportSimilarityQueries">Using Indexes to Support 
Similarity Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font> ##
 
@@ -146,101 +133,95 @@ detailed description of these techniques is available at 
this
 [paper](http://www.ics.uci.edu/~chenli/pub/icde2009-memreducer.pdf).
 
 For instance, the following DDL statements create an ngram index on the
-`FacebookUsers.name` attribute using an inverted index of 3-grams.
+`GleambookUsers.name` attribute using an inverted index of 3-grams.
 
-        use dataverse TinySocial;
+        use TinySocial;
 
-        create index fbUserIdx on FacebookUsers(name) type ngram(3);
+        create index gbUserIdx on GleambookUsers(name) type ngram(3);
 
 The number "3" in "ngram(3)" is the length "n" in the grams. This
 index can be used to optimize similarity queries on this attribute
 using
-[edit-distance](functions.html#edit-distance),
-[edit-distance-check](functions.html#edit-distance-check),
-[similarity-jaccard](functions.html#similarity-jaccard),
-or [similarity-jaccard-check](functions.html#similarity-jaccard-check)
+[edit_distance](../sqlpp/builtins.html#edit_distance),
+[edit_distance_check](../sqlpp/builtins.html#edit_distance_check),
+[similarity_jaccard](../sqlpp/builtins.html#similarity_jaccard),
+or [similarity_jaccard_check](../sqlpp/builtins.html#similarity_jaccard_check)
 queries on this attribute where the
 similarity is defined on sets of 3-grams.  This index can also be used
-to optimize queries with the "[contains()]((functions.html#contains))" 
predicate (i.e., substring
+to optimize queries with the "[contains()]((../sqlpp/builtins.html#contains))" 
predicate (i.e., substring
 matching) since it can be also be solved by counting on the inverted
 lists of the grams in the query string.
 
-#### NGram Index usage case - [edit-distance](functions.html#edit-distance) 
####
+#### NGram Index usage case - 
[edit_distance](../sqlpp/builtins.html#edit-distance) ####
 
-        use dataverse TinySocial;
+        use TinySocial;
 
-        for $user in dataset('FacebookUsers')
-        let $ed := edit-distance($user.name, "Suzanna Tilson")
-        where $ed <= 2
-        return $user
+        select u
+        from GleambookUsers u
+        where edit_distance(u.name, "Suzanna Tilson") <= 2;
 
-#### NGram Index usage case - 
[edit-distance-check](functions.html#edit-distance-check) ####
+#### NGram Index usage case - 
[edit_distance_check](../sqlpp/builtins.html#edit_distance_check) ####
 
-        use dataverse TinySocial;
+        use TinySocial;
 
-        for $user in dataset('FacebookUsers')
-        let $ed := edit-distance-check($user.name, "Suzanna Tilson", 2)
-        where $ed[0]
-        return $ed[1]
+        select u
+        from GleambookUsers u
+        where edit_distance_check(u.name, "Suzanna Tilson", 2)[0];
 
-#### NGram Index usage case - 
[similarity-jaccard](functions.html#similarity-jaccard) ####
+#### NGram Index usage case - [contains()]((../sqlpp/builtins.html#contains)) 
####
 
-        use dataverse TinySocial;
+        use TinySocial;
 
-        for $user in dataset('FacebookUsers')
-        let $sim := similarity-jaccard($user.friend-ids, [1,5,9,10])
-        where $sim >= 0.6f
-        return $user
+        select m
+        from GleambookMessages m
+        where contains(m.message, "phone");
 
-#### NGram Index usage case - 
[similarity-jaccard-check](functions.html#similarity-jaccard-check) ####
 
-        use dataverse TinySocial;
+### Keyword Index ###
 
-        for $user in dataset('FacebookUsers')
-        let $sim := similarity-jaccard-check($user.friend-ids, [1,5,9,10], 
0.6f)
-        where $sim[0]
-        return $user
+A "keyword index" is constructed on a set of strings or sets (e.g., array, 
multiset). Instead of
+generating grams as in an ngram index, we generate tokens (e.g., words) and 
for each token, construct an inverted list that includes the ids of the
+objects with this token.  The following two examples show how to create 
keyword index on two different types:
 
-#### NGram Index usage case - [contains()]((functions.html#contains)) ####
 
-        use dataverse TinySocial;
+#### Keyword Index on String Type ####
 
-        for $i in dataset('FacebookMessages')
-        where contains($i.message, "phone")
-        return {"mid": $i.message-id, "message": $i.message}
+        use TinySocial;
 
+        drop index GleambookMessages.gbMessageIdx if exists;
+        create index gbMessageIdx on GleambookMessages(message) type keyword;
 
-### Keyword Index ###
+        select m
+        from GleambookMessages m
+        where similarity_jaccard_check(word_tokens(m.message), 
word_tokens("love like ccast"), 0.2f)[0];
 
-A "keyword index" is constructed on a set of strings or sets (e.g., 
OrderedList, UnorderedList). Instead of
-generating grams as in an ngram index, we generate tokens (e.g., words) and 
for each token, construct an inverted list that includes the ids of the
-objects with this token.  The following two examples show how to create 
keyword index on two different types:
+#### Keyword Index on Multiset Type ####
 
+        use TinySocial;
 
-#### Keyword Index on String Type ####
+        create index gbUserIdxFIds on GleambookUsers(friendIds) type keyword;
 
-        use dataverse TinySocial;
+        select u
+        from GleambookUsers u
+        where similarity_jaccard_check(u.friendIds, {{3,10}}, 0.5f)[0];
 
-        drop index FacebookMessages.fbMessageIdx if exists;
-        create index fbMessageIdx on FacebookMessages(message) type keyword;
+As shown above, keyword index can be used to optimize queries with token-based 
similarity predicates, including
+[similarity_jaccard](../sqlpp/builtins.html#similarity_jaccard) and
+[similarity_jaccard_check](../sqlpp/builtins.html#similarity_jaccard_check).
 
-        for $o in dataset('FacebookMessages')
-        let $jacc := similarity-jaccard-check(word-tokens($o.message), 
word-tokens("love like ccast"), 0.2f)
-        where $jacc[0]
-        return $o
+#### Keyword Index usage case - 
[similarity_jaccard](../sqlpp/builtins.html#similarity_jaccard) ####
 
-#### Keyword Index on UnorderedList Type ####
+        use TinySocial;
 
-        use dataverse TinySocial;
+        select u
+        from GleambookUsers u
+        where similarity_jaccard(u.friendIds, [1,5,9,10]) >= 0.6f;
 
-        create index fbUserIdx_fids on FacebookUsers(friend-ids) type keyword;
+#### Keyword Index usage case - 
[similarity_jaccard_check](../sqlpp/builtins.html#similarity_jaccard_check) ####
 
-        for $c in dataset('FacebookUsers')
-        let $jacc := similarity-jaccard-check($c.friend-ids, {{3,10}}, 0.5f)
-        where $jacc[0]
-        return $c
+        use TinySocial;
 
-As shown above, keyword index can be used to optimize queries with token-based 
similarity predicates, including
-[similarity-jaccard](functions.html#similarity-jaccard) and
-[similarity-jaccard-check](functions.html#similarity-jaccard-check).
+        select u
+        from GleambookUsers u
+        where similarity_jaccard_check(u.friendIds, [1,5,9,10], 0.6f)[0];
 

Reply via email to