Change in asterixdb[master]: [ASTERIXDB-2349][SITE] Revise fulltext and similarity docume...

Taewoo Kim (Code Review) Wed, 04 Apr 2018 10:55:59 -0700

Taewoo Kim has uploaded a new change for review.

  https://asterix-gerrit.ics.uci.edu/2567


Change subject: [ASTERIXDB-2349][SITE] Revise fulltext and similarity 
documentation
......................................................................

[ASTERIXDB-2349][SITE] Revise fulltext and similarity documentation

- user model changes: no
- storage format changes: no
- interface changes: no

Details: Update all examples in the fulltext and similarity documentation
using SQLPP.

Change-Id: Icd9c5bc6249feb03b4297bdc84b5f3aa0efcdc47
---
M asterixdb/asterix-doc/src/site/markdown/aql/fulltext.md
M asterixdb/asterix-doc/src/site/markdown/aql/similarity.md
2 files changed, 104 insertions(+), 132 deletions(-)


  git pull ssh://asterix-gerrit.ics.uci.edu:29418/asterixdb 
refs/changes/67/2567/1

diff --git a/asterixdb/asterix-doc/src/site/markdown/aql/fulltext.md 
b/asterixdb/asterix-doc/src/site/markdown/aql/fulltext.md
index bc0b398..1328ed9 100644
--- a/asterixdb/asterix-doc/src/site/markdown/aql/fulltext.md
+++ b/asterixdb/asterix-doc/src/site/markdown/aql/fulltext.md
@@ -44,73 +44,61 @@
         ftcontains(Expression1, Expression2, {FullTextOption})
         ftcontains(Expression1, Expression2)
 
-For example, we can execute the following query to find tweet messages where 
the `message-text` field includes
+For example, we can execute the following query to find Chirp messages where 
the `messageText` field includes
 “voice” as a word. Please note that an FTS search is case-insensitive.
 Thus, "Voice" or "voice" will be evaluated as the same word.
 
-        use dataverse TinySocial;
-
-        for $msg in dataset TweetMessages
-        where ftcontains($msg.message-text, "voice", {"mode":"any"})
-        return {"id": $msg.id}
-
-The DDL and DML of TinySocial can be found in [ADM: Modeling Semistructed Data 
in AsterixDB](primer.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB).
-
-The same query can be also expressed in the SQL++.
-
         use TinySocial;
 
-        select element {"id":msg.id}
-        from TweetMessages as msg
-        where TinySocial.ftcontains(msg.`message-text`, "voice", 
{"mode":"any"})
+        select element {"chirpId": msg.chirpId}
+        from ChirpMessages msg
+        where ftcontains(msg.messageText, "voice", {"mode":"any"});
+
+The DDL and DML of TinySocial can be found in [ADM: Modeling Semistructed Data 
in 
AsterixDB](../sqlpp/primer-sqlpp.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB).
 
 The `Expression1` is an expression that should be evaluable as a string at 
runtime as in the above example
-where `$msg.message-text` is a string field. The `Expression2` can be a 
string, an (un)ordered list
+where `msg.messageText` is a string field. The `Expression2` can be a string, 
an (un)ordered list
 of string value(s), or an expression. In the last case, the given expression 
should be evaluable
 into one of the first two types, i.e., into a string value or an (un)ordered 
list of string value(s).
 
 The following examples are all valid expressions.
 
-       ... where ftcontains($msg.message-text, "sound")
-       ... where ftcontains($msg.message-text, "sound", {"mode":"any"})
-       ... where ftcontains($msg.message-text, ["sound", "system"], 
{"mode":"any"})
-       ... where ftcontains($msg.message-text, {{"speed", "stand", 
"customization"}}, {"mode":"all"})
-       ... where ftcontains($msg.message-text, let $keyword_list := ["voice", 
"system"] return $keyword_list, {"mode":"all"})
-       ... where ftcontains($msg.message-text, $keyword_list, {"mode":"any"})
-
-In the last example above, `$keyword_list` should evaluate to a string or an 
(un)ordered list of string value(s).
+       ... where ftcontains(msg.messageText, "sound")
+       ... where ftcontains(msg.messageText, "sound", {"mode":"any"})
+       ... where ftcontains(msg.messageText, ["sound", "system"], 
{"mode":"any"})
+       ... where ftcontains(msg.messageText, {{"speed", "stand", 
"customization"}}, {"mode":"all"})
 
 The last `FullTextOption` parameter clarifies the given FTS request. If you 
omit the `FullTextOption` parameter,
 then the default value will be set for each possible option. Currently, we 
only have one option named `mode`.
 And as we extend the FTS feature, more options will be added. Please note that 
the format of `FullTextOption`
 is a record, thus you need to put the option(s) in a record `{}`.
 The `mode` option indicates whether the given FTS query is a conjunctive (AND) 
or disjunctive (OR) search request.
-This option can be either `“any”` or `“all”`. The default value for `mode` is 
`“all”`. If one specifies `“any”`,
-a disjunctive search will be conducted. For example, the following query will 
find documents whose `message-text`
+This option can be either `“all”` (AND) or `“any”` (OR). The default value for 
`mode` is `“all”`. If one specifies `“any”`,
+a disjunctive search will be conducted. For example, the following query will 
find documents whose `messageText`
 field contains “sound” or “system”, so a document will be returned if it 
contains either “sound”, “system”,
 or both of the keywords.
 
-       ... where ftcontains($msg.message-text, ["sound", "system"], 
{"mode":"any"})
+       ... where ftcontains(msg.messageText, ["sound", "system"], 
{"mode":"any"})
 
 The other option parameter,`“all”`, specifies a conjunctive search. The 
following examples will find the documents whose
-`message-text` field contains both “sound” and “system”. If a document 
contains only “sound” or “system” but
+`messageText` field contains both “sound” and “system”. If a document contains 
only “sound” or “system” but
 not both, it will not be returned.
 
-       ... where ftcontains($msg.message-text, ["sound", "system"], 
{"mode":"all"})
-       ... where ftcontains($msg.message-text, ["sound", "system"])
+       ... where ftcontains(msg.messageText, ["sound", "system"], 
{"mode":"all"})
+       ... where ftcontains(msg.messageText, ["sound", "system"])
 
 Currently AsterixDB doesn’t (yet) support phrase searches, so the following 
query will not work.
 
-       ... where ftcontains($msg.message-text, "sound system", {"mode":"any"})
+       ... where ftcontains(msg.messageText, "sound system", {"mode":"any"})
 
 As a workaround solution, the following query can be used to achieve a roughly 
similar goal. The difference is that
-the following queries will find documents where `$msg.message-text` contains 
both “sound” and “system”, but the order
+the following queries will find documents where `msg.messageText` contains 
both “sound” and “system”, but the order
 and adjacency of “sound” and “system” are not checked, unlike in a phrase 
search. As a result, the query below would
 also return documents with “sound system can be installed.”, “system sound is 
perfect.”,
 or “sound is not clear. You may need to install a new system.”
 
-       ... where ftcontains($msg.message-text, ["sound", "system"], 
{"mode":"all"})
-       ... where ftcontains($msg.message-text, ["sound", "system"])
+       ... where ftcontains(msg.messageText, ["sound", "system"], 
{"mode":"all"})
+       ... where ftcontains(msg.messageText, ["sound", "system"])
 
 
 ## <a id="FulltextIndex">Creating and utilizing a Full-text index</a> <font 
size="4"><a href="#toc">[Back to TOC]</a></font> ##
@@ -118,6 +106,9 @@
 When there is a full-text index on the field that is being searched, rather 
than scanning all records,
 AsterixDB can utilize that index to expedite the execution of a FTS query. To 
create a full-text index,
 you need to specify the index type as `fulltext` in your DDL statement. For 
instance, the following DDL
-statement create a full-text index on the TweetMessages.message-text attribute.
+statement create a full-text index on the `GleambookMessages.message` 
attribute. Note that a full-text index
+cannot be built on a dataset with the variable-length primary key (e.g., 
string).
 
-    create index messageFTSIdx on TweetMessages(message-text) type fulltext;
+    use TinySocial;
+
+    create index messageFTSIdx on GleambookMessages(message) type fulltext;
diff --git a/asterixdb/asterix-doc/src/site/markdown/aql/similarity.md 
b/asterixdb/asterix-doc/src/site/markdown/aql/similarity.md
index 0d949db..5d4fc9d 100644
--- a/asterixdb/asterix-doc/src/site/markdown/aql/similarity.md
+++ b/asterixdb/asterix-doc/src/site/markdown/aql/similarity.md
@@ -45,8 +45,8 @@
 AsterixDB supports [edit 
distance](http://en.wikipedia.org/wiki/Levenshtein_distance) (on strings) and
 [Jaccard](http://en.wikipedia.org/wiki/Jaccard_index) (on sets).  For
 instance, in our
-[TinySocial](primer.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB)
-example, the `friend-ids` of a Facebook user forms a set
+[TinySocial](../sqlpp/primer-sqlpp.html#ADM:_Modeling_Semistructed_Data_in_AsterixDB)
+example, the `friendIds` of a Gleambook user forms a set
 of friends, and we can define a similarity between the sets of
 friends of two users. We can also convert a string to a set of grams of a 
length "n"
 (called "n-grams") and define the Jaccard similarity between the two
@@ -55,50 +55,45 @@
 `schwarzenegger` are `sch`, `chw`, `hwa`, ..., `ger`.
 
 AsterixDB provides
-[tokenization functions](functions.html#Tokenizing_Functions)
+[tokenization functions](../sqlpp/builtins.html#Tokenizing_Functions)
 to convert strings to sets, and the
-[similarity functions](functions.html#Similarity_Functions).
+[similarity functions](../sqlpp/builtins.html#Similarity_Functions).
 
 ## <a id="SimilaritySelectionQueries">Similarity Selection Queries</a> <font 
size="4"><a href="#toc">[Back to TOC]</a></font> ##
 
 The following query
-asks for all the Facebook users whose name is similar to
+asks for all the Gleambook users whose name is similar to
 `Suzanna Tilson`, i.e., their edit distance is at most 2.
 
-        use dataverse TinySocial;
+        use TinySocial;
 
-        for $user in dataset('FacebookUsers')
-        let $ed := edit-distance($user.name, "Suzanna Tilson")
-        where $ed <= 2
-        return $user
-
+        select u
+        from GleambookUsers u
+        where edit_distance(u.name, "Suzanna Tilson") <= 2;
 
 The following query
-asks for all the Facebook users whose set of friend ids is
+asks for all the Gleambook users whose set of friend ids is
 similar to `[1,5,9,10]`, i.e., their Jaccard similarity is at least 0.6.
 
-        use dataverse TinySocial;
+        use TinySocial;
 
-        for $user in dataset('FacebookUsers')
-        let $sim := similarity-jaccard($user.friend-ids, [1,5,9,10])
-        where $sim >= 0.6f
-        return $user
-
+        select u
+        from GleambookUsers u
+        where similarity_jaccard(u.friendIds, [1,5,9,10]) >= 0.6f;
 
 AsterixDB allows a user to use a similarity operator `~=` to express a
 condition by defining the similarity function and threshold
 using "set" statements earlier. For instance, the above query can be
 equivalently written as:
 
-        use dataverse TinySocial;
+        use TinySocial;
 
         set simfunction "jaccard";
         set simthreshold "0.6f";
 
-        for $user in dataset('FacebookUsers')
-        where $user.friend-ids ~= [1,5,9,10]
-        return $user
-
+        select u
+        from GleambookUsers u
+        where u.friendIds ~= [1,5,9,10];
 
 In this query, we first declare Jaccard as the similarity function
 using `simfunction` and then specify the threshold `0.6f` using
@@ -107,27 +102,19 @@
 ## <a id="SimilarityJoinQueries">Similarity Join Queries</a> <font size="4"><a 
href="#toc">[Back to TOC]</a></font> ##
 
 AsterixDB supports fuzzy joins between two sets. The following
-[query](primer.html#Query_5_-_Fuzzy_Join)
-finds, for each Facebook user, all Twitter users with names
+[query](../sqlpp/primer-sqlpp.html#Query_5_-_Fuzzy_Join)
+finds, for each Gleambook user, all Chirp users with names
 similar to their name based on the edit distance.
 
-        use dataverse TinySocial;
+        use TinySocial;
 
         set simfunction "edit-distance";
         set simthreshold "3";
 
-        for $fbu in dataset FacebookUsers
-        return {
-            "id": $fbu.id,
-            "name": $fbu.name,
-            "similar-users": for $t in dataset TweetMessages
-                                let $tu := $t.user
-                                where $tu.name ~= $fbu.name
-                                return {
-                                "twitter-screenname": $tu.screen-name,
-                                "twitter-name": $tu.name
-                                }
-        };
+        select gbu.id, gbu.name, (select cu.screenName, cu.name
+                                  from ChirpUsers cu
+                                  where cu.name ~= gbu.name) as similar_users
+        from GleambookUsers gbu;
 
 ## <a id="UsingIndexesToSupportSimilarityQueries">Using Indexes to Support 
Similarity Queries</a> <font size="4"><a href="#toc">[Back to TOC]</a></font> ##
 
@@ -146,101 +133,95 @@
 [paper](http://www.ics.uci.edu/~chenli/pub/icde2009-memreducer.pdf).
 
 For instance, the following DDL statements create an ngram index on the
-`FacebookUsers.name` attribute using an inverted index of 3-grams.
+`GleambookUsers.name` attribute using an inverted index of 3-grams.
 
-        use dataverse TinySocial;
+        use TinySocial;
 
-        create index fbUserIdx on FacebookUsers(name) type ngram(3);
+        create index gbUserIdx on GleambookUsers(name) type ngram(3);
 
 The number "3" in "ngram(3)" is the length "n" in the grams. This
 index can be used to optimize similarity queries on this attribute
 using
-[edit-distance](functions.html#edit-distance),
-[edit-distance-check](functions.html#edit-distance-check),
-[similarity-jaccard](functions.html#similarity-jaccard),
-or [similarity-jaccard-check](functions.html#similarity-jaccard-check)
+[edit_distance](../sqlpp/builtins.html#edit_distance),
+[edit_distance_check](../sqlpp/builtins.html#edit_distance_check),
+[similarity_jaccard](../sqlpp/builtins.html#similarity_jaccard),
+or [similarity_jaccard_check](../sqlpp/builtins.html#similarity_jaccard_check)
 queries on this attribute where the
 similarity is defined on sets of 3-grams.  This index can also be used
-to optimize queries with the "[contains()]((functions.html#contains))" 
predicate (i.e., substring
+to optimize queries with the "[contains()]((../sqlpp/builtins.html#contains))" 
predicate (i.e., substring
 matching) since it can be also be solved by counting on the inverted
 lists of the grams in the query string.
 
-#### NGram Index usage case - [edit-distance](functions.html#edit-distance) 
####
+#### NGram Index usage case - 
[edit_distance](../sqlpp/builtins.html#edit-distance) ####
 
-        use dataverse TinySocial;
+        use TinySocial;
 
-        for $user in dataset('FacebookUsers')
-        let $ed := edit-distance($user.name, "Suzanna Tilson")
-        where $ed <= 2
-        return $user
+        select u
+        from GleambookUsers u
+        where edit_distance(u.name, "Suzanna Tilson") <= 2;
 
-#### NGram Index usage case - 
[edit-distance-check](functions.html#edit-distance-check) ####
+#### NGram Index usage case - 
[edit_distance_check](../sqlpp/builtins.html#edit_distance_check) ####
 
-        use dataverse TinySocial;
+        use TinySocial;
 
-        for $user in dataset('FacebookUsers')
-        let $ed := edit-distance-check($user.name, "Suzanna Tilson", 2)
-        where $ed[0]
-        return $ed[1]
+        select u
+        from GleambookUsers u
+        where edit_distance_check(u.name, "Suzanna Tilson", 2)[0];
 
-#### NGram Index usage case - 
[similarity-jaccard](functions.html#similarity-jaccard) ####
+#### NGram Index usage case - [contains()]((../sqlpp/builtins.html#contains)) 
####
 
-        use dataverse TinySocial;
+        use TinySocial;
 
-        for $user in dataset('FacebookUsers')
-        let $sim := similarity-jaccard($user.friend-ids, [1,5,9,10])
-        where $sim >= 0.6f
-        return $user
-
-#### NGram Index usage case - 
[similarity-jaccard-check](functions.html#similarity-jaccard-check) ####
-
-        use dataverse TinySocial;
-
-        for $user in dataset('FacebookUsers')
-        let $sim := similarity-jaccard-check($user.friend-ids, [1,5,9,10], 
0.6f)
-        where $sim[0]
-        return $user
-
-#### NGram Index usage case - [contains()]((functions.html#contains)) ####
-
-        use dataverse TinySocial;
-
-        for $i in dataset('FacebookMessages')
-        where contains($i.message, "phone")
-        return {"mid": $i.message-id, "message": $i.message}
+        select m
+        from GleambookMessages m
+        where contains(m.message, "phone");
 
 
 ### Keyword Index ###
 
-A "keyword index" is constructed on a set of strings or sets (e.g., 
OrderedList, UnorderedList). Instead of
+A "keyword index" is constructed on a set of strings or sets (e.g., array, 
multiset). Instead of
 generating grams as in an ngram index, we generate tokens (e.g., words) and 
for each token, construct an inverted list that includes the ids of the
 objects with this token.  The following two examples show how to create 
keyword index on two different types:
 
 
 #### Keyword Index on String Type ####
 
-        use dataverse TinySocial;
+        use TinySocial;
 
-        drop index FacebookMessages.fbMessageIdx if exists;
-        create index fbMessageIdx on FacebookMessages(message) type keyword;
+        drop index GleambookMessages.gbMessageIdx if exists;
+        create index gbMessageIdx on GleambookMessages(message) type keyword;
 
-        for $o in dataset('FacebookMessages')
-        let $jacc := similarity-jaccard-check(word-tokens($o.message), 
word-tokens("love like ccast"), 0.2f)
-        where $jacc[0]
-        return $o
+        select m
+        from GleambookMessages m
+        where similarity_jaccard_check(word_tokens(m.message), 
word_tokens("love like ccast"), 0.2f)[0];
 
 #### Keyword Index on UnorderedList Type ####
 
-        use dataverse TinySocial;
+        use TinySocial;
 
-        create index fbUserIdx_fids on FacebookUsers(friend-ids) type keyword;
+        create index gbUserIdxFIds on GleambookUsers(friendIds) type keyword;
 
-        for $c in dataset('FacebookUsers')
-        let $jacc := similarity-jaccard-check($c.friend-ids, {{3,10}}, 0.5f)
-        where $jacc[0]
-        return $c
+        select u
+        from GleambookUsers u
+        where similarity_jaccard_check(u.friendIds, {{3,10}}, 0.5f)[0];
 
 As shown above, keyword index can be used to optimize queries with token-based 
similarity predicates, including
-[similarity-jaccard](functions.html#similarity-jaccard) and
-[similarity-jaccard-check](functions.html#similarity-jaccard-check).
+[similarity_jaccard](../sqlpp/builtins.html#similarity_jaccard) and
+[similarity_jaccard_check](../sqlpp/builtins.html#similarity_jaccard_check).
+
+#### Keyword Index usage case - 
[similarity_jaccard](../sqlpp/builtins.html#similarity_jaccard) ####
+
+        use TinySocial;
+
+        select u
+        from GleambookUsers u
+        where similarity_jaccard(u.friendIds, [1,5,9,10]) >= 0.6f;
+
+#### Keyword Index usage case - 
[similarity_jaccard_check](../sqlpp/builtins.html#similarity_jaccard_check) ####
+
+        use TinySocial;
+
+        select u
+        from GleambookUsers u
+        where similarity_jaccard_check(u.friendIds, [1,5,9,10], 0.6f)[0];
 

-- 
To view, visit https://asterix-gerrit.ics.uci.edu/2567
To unsubscribe, visit https://asterix-gerrit.ics.uci.edu/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Icd9c5bc6249feb03b4297bdc84b5f3aa0efcdc47
Gerrit-PatchSet: 1
Gerrit-Project: asterixdb
Gerrit-Branch: master
Gerrit-Owner: Taewoo Kim <[email protected]>

Change in asterixdb[master]: [ASTERIXDB-2349][SITE] Revise fulltext and similarity docume...

Reply via email to