Repository: asterixdb Updated Branches: refs/heads/master 30c5959d5 -> 10351a747
http://git-wip-us.apache.org/repos/asf/asterixdb/blob/10351a74/asterixdb/asterix-doc/src/site/markdown/aql/manual.md ---------------------------------------------------------------------- diff --git a/asterixdb/asterix-doc/src/site/markdown/aql/manual.md b/asterixdb/asterix-doc/src/site/markdown/aql/manual.md index 393beec..ecdc715 100644 --- a/asterixdb/asterix-doc/src/site/markdown/aql/manual.md +++ b/asterixdb/asterix-doc/src/site/markdown/aql/manual.md @@ -66,7 +66,7 @@ Each will be detailed as we explore the full AQL grammar. | FunctionCallExpr | DatasetAccessExpression | ListConstructor - | RecordConstructor + | ObjectConstructor The most basic building block for any AQL expression is the PrimaryExpr. This can be a simple literal (constant) value, @@ -75,7 +75,7 @@ a parenthesized expression, a function call, an expression accessing the ADM contents of a dataset, a newly constructed list of ADM instances, -or a newly constructed ADM record. +or a newly constructed ADM object. #### Literals @@ -168,7 +168,7 @@ The following example is a (built-in) function call expression whose value is 8. <SPECIALCHARS> ::= ["$", "_", "-"] Querying Big Data is the main point of AsterixDB and AQL. -Data in AsterixDB reside in datasets (collections of ADM records), +Data in AsterixDB reside in datasets (collections of ADM objects), each of which in turn resides in some namespace known as a dataverse (data universe). Data access in a query expression is accomplished via a DatasetAccessExpression. Dataset access expressions are most commonly used in FLWOR expressions, where variables @@ -193,21 +193,21 @@ The third one does the same thing as the second but uses a slightly older AQL sy ListConstructor ::= ( OrderedListConstructor | UnorderedListConstructor ) OrderedListConstructor ::= "[" ( Expression ( "," Expression )* )? "]" UnorderedListConstructor ::= "{{" ( Expression ( "," Expression )* )? "}}" - RecordConstructor ::= "{" ( FieldBinding ( "," FieldBinding )* )? "}" + ObjectConstructor ::= "{" ( FieldBinding ( "," FieldBinding )* )? "}" FieldBinding ::= Expression ":" Expression A major feature of AQL is its ability to construct new ADM data instances. This is accomplished using its constructors for each of the major ADM complex object structures, -namely lists (ordered or unordered) and records. +namely lists (ordered or unordered) and objects. Ordered lists are like JSON arrays, while unordered lists have bag (multiset) semantics. -Records are built from attributes that are field-name/field-value pairs, again like JSON. +Objects are built from attributes that are field-name/field-value pairs, again like JSON. (See the AsterixDB Data Model document for more details on each.) The following examples illustrate how to construct a new ordered list with 3 items, -a new unordered list with 4 items, and a new record with 2 fields, respectively. +a new unordered list with 4 items, and a new object with 2 fields, respectively. List elements can be homogeneous (as in the first example), which is the common case, or they may be heterogeneous (as in the second example). -The data values and field name values used to construct lists and records in constructors are all simply AQL expressions. +The data values and field name values used to construct lists and objects in constructors are all simply AQL expressions. Thus the list elements, field names, and field values used in constructors can be simple literals (as in these three examples) or they can come from query variable references or even arbitrarily complex AQL expressions. @@ -224,7 +224,7 @@ or they can come from query variable references or even arbitrarily complex AQL ##### Note -When constructing nested records there needs to be a space between the closing braces to avoid confusion with the `}}` token that ends an unordered list constructor: +When constructing nested objects there needs to be a space between the closing braces to avoid confusion with the `}}` token that ends an unordered list constructor: `{ "a" : { "b" : "c" }}` will fail to parse while `{ "a" : { "b" : "c" } }` will work. ### Path Expressions @@ -234,13 +234,13 @@ When constructing nested records there needs to be a space between the closing b Index ::= "[" ( Expression | "?" ) "]" Components of complex types in ADM are accessed via path expressions. -Path access can be applied to the result of an AQL expression that yields an instance of such a type, e.g., a record or list instance. -For records, path access is based on field names. +Path access can be applied to the result of an AQL expression that yields an instance of such a type, e.g., a object or list instance. +For objects, path access is based on field names. For ordered lists, path access is based on (zero-based) array-style indexing. AQL also supports an "I'm feeling lucky" style index accessor, [?], for selecting an arbitrary element from an ordered list. Attempts to access non-existent fields or list elements produce a null (i.e., missing information) result as opposed to signaling a runtime error. -The following examples illustrate field access for a record, index-based element access for an ordered list, and also a composition thereof. +The following examples illustrate field access for a object, index-based element access for an ordered list, and also a composition thereof. ##### Examples @@ -341,7 +341,7 @@ The following example shows a FLWOR expression that selects and returns one user The next example shows a FLWOR expression that joins two datasets, FacebookUsers and FacebookMessages, returning user/message pairs. -The results contain one record per pair, with result records containing the user's name and an entire message. +The results contain one object per pair, with result objects containing the user's name and an entire message. ##### Example @@ -355,7 +355,7 @@ The results contain one record per pair, with result records containing the user }; In the next example, a `let` clause is used to bind a variable to all of a user's FacebookMessages. -The query returns one record per user, with result records containing the user's name and the set of all messages by that user. +The query returns one object per user, with result objects containing the user's name and the set of all messages by that user. ##### Example @@ -485,7 +485,7 @@ It is useful to note that if the set were instead the empty set, the first expre In addition to expresssions for queries, AQL supports a variety of statements for data definition and manipulation purposes as well as controlling the context to be used in -evaluating AQL expressions. AQL supports record-level ACID transactions that begin and terminate implicitly for each record inserted, deleted, upserted, or searched while a given AQL statement is being executed. +evaluating AQL expressions. AQL supports object-level ACID transactions that begin and terminate implicitly for each object inserted, deleted, upserted, or searched while a given AQL statement is being executed. This section details the statements supported in the AQL language. @@ -564,9 +564,9 @@ The following example creates a dataverse named TinySocial. TypeSpecification ::= "type" FunctionOrTypeName IfNotExists "as" TypeExpr FunctionOrTypeName ::= QualifiedName IfNotExists ::= ( "if not exists" )? - TypeExpr ::= RecordTypeDef | TypeReference | OrderedListTypeDef | UnorderedListTypeDef - RecordTypeDef ::= ( "closed" | "open" )? "{" ( RecordField ( "," RecordField )* )? "}" - RecordField ::= Identifier ":" ( TypeExpr ) ( "?" )? + TypeExpr ::= ObjectTypeDef | TypeReference | OrderedListTypeDef | UnorderedListTypeDef + ObjectTypeDef ::= ( "closed" | "open" )? "{" ( ObjectField ( "," ObjectField )* )? "}" + ObjectField ::= Identifier ":" ( TypeExpr ) ( "?" )? NestedField ::= Identifier ( "." Identifier )* IndexField ::= NestedField ( ":" TypeReference )? TypeReference ::= Identifier @@ -576,16 +576,16 @@ The following example creates a dataverse named TinySocial. The create type statement is used to create a new named ADM datatype. This type can then be used to create datasets or utilized when defining one or more other ADM datatypes. Much more information about the Asterix Data Model (ADM) is available in the [data model reference guide](datamodel.html) to ADM. -A new type can be a record type, a renaming of another type, an ordered list type, or an unordered list type. -A record type can be defined as being either open or closed. -Instances of a closed record type are not permitted to contain fields other than those specified in the create type statement. -Instances of an open record type may carry additional fields, and open is the default for a new type (if neither option is specified). +A new type can be a object type, a renaming of another type, an ordered list type, or an unordered list type. +A object type can be defined as being either open or closed. +Instances of a closed object type are not permitted to contain fields other than those specified in the create type statement. +Instances of an open object type may carry additional fields, and open is the default for a new type (if neither option is specified). -The following example creates a new ADM record type called FacebookUser type. +The following example creates a new ADM object type called FacebookUser type. Since it is closed, its instances will contain only what is specified in the type definition. The first four fields are traditional typed name/value pairs. The friend-ids field is an unordered list of 32-bit integers. -The employment field is an ordered list of instances of another named record type, EmploymentType. +The employment field is an ordered list of instances of another named object type, EmploymentType. ##### Example @@ -598,7 +598,7 @@ The employment field is an ordered list of instances of another named record typ "employment" : [ EmploymentType ] } -The next example creates a new ADM record type called FbUserType. Note that the type of the id field is UUID. You need to use this field type if you want to have this field be an autogenerated-PK field. Refer to the Datasets section later for more details. +The next example creates a new ADM object type called FbUserType. Note that the type of the id field is UUID. You need to use this field type if you want to have this field be an autogenerated-PK field. Refer to the Datasets section later for more details. ##### Example @@ -628,12 +628,12 @@ The next example creates a new ADM record type called FbUserType. Note that the PrimaryKey ::= "primary" "key" Identifier ( "," Identifier )* ( "autogenerated ")? The create dataset statement is used to create a new dataset. -Datasets are named, unordered collections of ADM record instances; they +Datasets are named, unordered collections of ADM object instances; they are where data lives persistently and are the targets for queries in AsterixDB. Datasets are typed, and AsterixDB will ensure that their contents conform to their type definitions. An Internal dataset (the default) is a dataset that is stored in and managed by AsterixDB. It must have a specified unique primary key that can be used to partition data across nodes of an AsterixDB cluster. -The primary key is also used in secondary indexes to uniquely identify the indexed primary data records. Random primary key (UUID) values can be auto-generated by declaring the field to be UUID and putting "autogenerated" after the "primary key" identifier. In this case, values for the auto-generated PK field should not be provided by the user since it will be auto-generated by AsterixDB. +The primary key is also used in secondary indexes to uniquely identify the indexed primary data objects. Random primary key (UUID) values can be auto-generated by declaring the field to be UUID and putting "autogenerated" after the "primary key" identifier. In this case, values for the auto-generated PK field should not be provided by the user since it will be auto-generated by AsterixDB. Optionally, a filter can be created on a field to further optimize range queries with predicates on the filter's field. (Refer to [Filter-Based LSM Index Acceleration](filters.html) for more information about filters.) @@ -667,19 +667,19 @@ associated with the same dataset. The default policy for AsterixDB is the prefix policy except when there is a filter on a dataset, where the preferred policy for filters is the correlated-prefix. -The following example creates an internal dataset for storing FacefookUserType records. +The following example creates an internal dataset for storing FacefookUserType objects. It specifies that their id field is their primary key. ##### Example create internal dataset FacebookUsers(FacebookUserType) primary key id; -The following example creates an internal dataset for storing FbUserType records. -It specifies that their id field is their primary key. It also specifies that the id field is an auto-generated field, meaning that a randomly generated UUID value will be assigned to each record by the system. (A user should therefore not proivde a value for this field.) Note that the id field should be UUID. +The following example creates an internal dataset for storing FbUserType objects. +It specifies that their id field is their primary key. It also specifies that the id field is an auto-generated field, meaning that a randomly generated UUID value will be assigned to each object by the system. (A user should therefore not proivde a value for this field.) Note that the id field should be UUID. ##### Example create internal dataset FbMsgs(FbUserType) primary key id autogenerated; -The next example creates an external dataset for storing LineitemType records. +The next example creates an external dataset for storing LineitemType objects. The choice of the `hdfs` adapter means that its data will reside in HDFS. The create statement provides parameters used by the hdfs adapter: the URL and path needed to locate the data in HDFS and a description of the data format. @@ -708,7 +708,7 @@ An index can be created on a nested field (or fields) by providing a valid path An index field is not required to be part of the datatype associated with a dataset if that datatype is declared as open and the field's type is provided along with its type and the `enforced` keyword is specified in the end of index definition. `Enforcing` an open field will introduce a check that will make sure that the actual type of an indexed -field (if the field exists in the record) always matches this specified (open) field type. +field (if the field exists in the object) always matches this specified (open) field type. The following example creates a btree index called fbAuthorIdx on the author-id field of the FacebookMessages dataset. This index can be useful for accelerating exact-match queries, range search queries, and joins involving the author-id field. @@ -834,12 +834,11 @@ being the insertion of a single object plus its affiliated secondary index entri If the query part of an insert returns a single object, then the insert statement itself will be a single, atomic transaction. If the query part returns multiple objects, then each object inserted will be handled independently -as a tranaction. If a dataset has an auto-generated primary key field, an insert statement should not include a value for that field in it. (The system will automatically extend the provided record with this additional field and a corresponding value.). -The optional "as Variable" provides a variable binding for the inserted records, which can be used in the "returning" clause. -The optional "returning Query" allows users to run simple queries/functions on the records returned by the insert. +as a tranaction. If a dataset has an auto-generated primary key field, an insert statement should not include a value for that field in it. (The system will automatically extend the provided object with this additional field and a corresponding value.). +The optional "as Variable" provides a variable binding for the inserted objects, which can be used in the "returning" clause. +The optional "returning Query" allows users to run simple queries/functions on the objects returned by the insert. This query cannot refer to any datasets. - The following example illustrates a query-based insertion. ##### Example http://git-wip-us.apache.org/repos/asf/asterixdb/blob/10351a74/asterixdb/asterix-doc/src/site/markdown/aql/primer.md ---------------------------------------------------------------------- diff --git a/asterixdb/asterix-doc/src/site/markdown/aql/primer.md b/asterixdb/asterix-doc/src/site/markdown/aql/primer.md index e07edb6..d245158 100644 --- a/asterixdb/asterix-doc/src/site/markdown/aql/primer.md +++ b/asterixdb/asterix-doc/src/site/markdown/aql/primer.md @@ -132,12 +132,12 @@ some of the key features of AsterixDB. :-)) The first three lines above tell AsterixDB to drop the old TinySocial dataverse, if one already exists, and then to create a brand new one and make it the focus of the statements that follow. The first _create type_ statement creates a datatype for holding information about Chirp users. -It is a record type with a mix of integer and string data, very much like a (flat) relational tuple. +It is a object type with a mix of integer and string data, very much like a (flat) relational tuple. The indicated fields are all mandatory, but because the type is open, additional fields are welcome. The second statement creates a datatype for Chirp messages; this shows how to specify a closed type. Interestingly (based on one of Chirp's APIs), each Chirp message actually embeds an instance of the sending user's information (current as of when the message was sent), so this is an example of a nested -record in ADM. +object in ADM. Chirp messages can optionally contain the sender's location, which is modeled via the senderLocation field of spatial type _point_; the question mark following the field type indicates its optionality. An optional field is like a nullable field in SQL---it may be present or missing, but when it's present, @@ -147,11 +147,11 @@ Lastly, the referredTopics field illustrates another way that ADM is richer than this field holds a bag (*a.k.a.* an unordered list) of strings. Since the overall datatype definition for Chirp messages says "closed", the fields that it lists are the only fields that instances of this type will be allowed to contain. -The next two _create type_ statements create a record type for holding information about one component of -the employment history of a Gleambook user and then a record type for holding the user information itself. +The next two _create type_ statements create a object type for holding information about one component of +the employment history of a Gleambook user and then a object type for holding the user information itself. The Gleambook user type highlights a few additional ADM data model features. Its friendIds field is a bag of integers, presumably the Gleambook user ids for this user's friends, -and its employment field is an ordered list of employment records. +and its employment field is an ordered list of employment objects. The final _create type_ statement defines a type for handling the content of a Gleambook message in our hypothetical social data storage scenario. @@ -243,7 +243,7 @@ was named in the most recently executed _use dataverse_ directive. ## Loading Data Into AsterixDB ## Okay, so far so good---AsterixDB is now ready for data, so let's give it some data to store. Our next task will be to load some sample data into the four datasets that we just defined. -Here we will load a tiny set of records, defined in ADM format (a superset of JSON), into each dataset. +Here we will load a tiny set of objects, defined in ADM format (a superset of JSON), into each dataset. In the boxes below you can see the actual data instances contained in each of the provided sample files. In order to load this data yourself, you should first store the four corresponding `.adm` files (whose URLs are indicated on top of each box below) into a filesystem directory accessible to your @@ -307,7 +307,7 @@ of the data instances will be stored separately from their associated field name {"messageId":14,"authorId":9,"inResponseTo":12,"senderLocation":point("41.33,85.28"),"message":" love at&t its 3G is good:)"} {"messageId":15,"authorId":7,"inResponseTo":11,"senderLocation":point("44.47,67.11"),"message":" like iphone the voicemail-service is awesome"} -It's loading time! We can use AQL _LOAD_ statements to populate our datasets with the sample records shown above. +It's loading time! We can use AQL _LOAD_ statements to populate our datasets with the sample objects shown above. The following shows how loading can be done for data stored in `.adm` files in your local filesystem. *Note:* You _MUST_ replace the `<Host Name>` and `<Absolute File Path>` placeholders in each load statement below with valid values based on the host IP address (or host name) for the machine and @@ -466,7 +466,7 @@ We could do this as follows in AQL: }; The result of this query is a sequence of new ADM instances, one for each author/message pair. -Each instance in the result will be an ADM record containing two fields, "uname" and "message", +Each instance in the result will be an ADM object containing two fields, "uname" and "message", containing the user's name and the message text, respectively, for each author/message pair. (Note that "uname" and "message" are both simple AQL expressions themselves---so in the most general case, even the resulting field names can be computed as part of the query, making AQL @@ -561,7 +561,7 @@ grouped by customer, without omitting those customers who haven't placed any ord The AQL language supports nesting, both of queries and of query results, and the combination allows for an arguably cleaner/more natural approach to such queries. -As an example, supposed we wanted, for each Gleambook user, to produce a record that has his/her name +As an example, supposed we wanted, for each Gleambook user, to produce a object that has his/her name plus a list of the messages written by that user. In SQL, this would involve a left outer join between users and messages, grouping by user, and having the user name repeated along side each message. @@ -578,7 +578,7 @@ In AQL, this sort of use case can be handled (more naturally) as follows: }; This AQL query binds the variable `$user` to the data instances in GleambookUsers; -for each user, it constructs a result record containing a "uname" field with the user's +for each user, it constructs a result object containing a "uname" field with the user's name and a "messages" field with a nested collection of all messages for that user. The nested collection for each user is specified by using a correlated subquery. (Note: While it looks like nested loops could be involved in computing the result, @@ -678,7 +678,7 @@ The expected result for this query against our sample data is: The expressive power of AQL includes support for queries involving "some" (existentially quantified) and "all" (universally quantified) query semantics. As an example of an existential AQL query, here we show a query to list the Gleambook users who are currently employed. -Such employees will have an employment history containing a record with the endDate value missing, which leads us to the +Such employees will have an employment history containing a object with the endDate value missing, which leads us to the following AQL query: use dataverse TinySocial; @@ -699,7 +699,7 @@ The expected result in this case is: ### Query 7 - Universal Quantification ### As an example of a universal AQL query, here we show a query to list the Gleambook users who are currently unemployed. -Such employees will have an employment history containing no records that miss endDate values, leading us to the +Such employees will have an employment history containing no objects that miss endDate values, leading us to the following AQL query: use dataverse TinySocial; @@ -747,11 +747,11 @@ Thus, following the _group by_ clause, the _return_ clause in this query sees a with each such group having an associated $uid variable value (i.e., the chirping user's screen name). In the context of the return clause, due to "... with $cm ...", $uid is bound to the chirper's id and $cm is bound to the _set_ of chirps issued by that chirper. -The return clause constructs a result record containing the chirper's user id and the count of the items +The return clause constructs a result object containing the chirper's user id and the count of the items in the associated chirp set. -The query result will contain one such record per screen name. +The query result will contain one such object per screen name. This query also illustrates another feature of AQL; notice that each user's screen name is accessed via a -path syntax that traverses each chirp's nested record structure. +path syntax that traverses each chirp's nested object structure. Here is the expected result for this query over the sample data: @@ -832,7 +832,7 @@ finds all of the chirps that are similar based on the topics that they refer to: This query illustrates several things worth knowing in order to write fuzzy queries in AQL. First, as mentioned earlier, AQL offers an operator-based syntax for seeing whether two values are "similar" to one another or not. -Second, recall that the referredTopics field of records of datatype ChirpMessageType is a bag of strings. +Second, recall that the referredTopics field of objects of datatype ChirpMessageType is a bag of strings. This query sets the context for its similarity join by requesting that Jaccard-based similarity semantics ([http://en.wikipedia.org/wiki/Jaccard_index](http://en.wikipedia.org/wiki/Jaccard_index)) be used for the query's similarity operator and that a similarity index of 0.3 be used as its similarity threshold. @@ -881,7 +881,7 @@ have all gone up in the interim, although he appears not to have moved in the la In general, the data to be inserted may be specified using any valid AQL query expression. The insertion of a single object instance, as in this example, is just a special case where -the query expression happens to be a record constructor involving only constants. +the query expression happens to be a object constructor involving only constants. ### Deleting Existing Data ### In addition to inserting new data, AsterixDB supports deletion from datasets via the AQL _delete_ statement. @@ -896,13 +896,13 @@ The following example deletes the chirp that we just added from user "NathanGies It should be noted that one form of data change not yet supported by AsterixDB is in-place data modification (_update_). Currently, only insert and delete operations are supported; update is not. -To achieve the effect of an update, two statements are currently needed---one to delete the old record from the -dataset where it resides, and another to insert the new replacement record (with the same primary key but with +To achieve the effect of an update, two statements are currently needed---one to delete the old object from the +dataset where it resides, and another to insert the new replacement object (with the same primary key but with different field values for some of the associated data content). ### Upserting Data ### In addition to loading, querying, inserting, and deleting data, AsterixDB supports upserting -records using the AQL _upsert_ statement. +objects using the AQL _upsert_ statement. The following example deletes the chirp with chirpId = 20 (if one exists) and inserts the new chirp with chirpId = 20 by user "SwanSmitty" to the ChirpMessages dataset. The two @@ -948,11 +948,11 @@ For example, the following statement might be used to double the followers count Note that such an upsert operation is executed in two steps: The query is performed, after which the query's locks are released, and then its result is upserted into the dataset. -This means that a record can be modified between computing the query result and performing the upsert. +This means that a object can be modified between computing the query result and performing the upsert. ### Transaction Support -AsterixDB supports record-level ACID transactions that begin and terminate implicitly for each record inserted, deleted, or searched while a given AQL statement is being executed. This is quite similar to the level of transaction support found in today's NoSQL stores. AsterixDB does not support multi-statement transactions, and in fact an AQL statement that involves multiple records can itself involve multiple independent record-level transactions. An example consequence of this is that, when an AQL statement attempts to insert 1000 records, it is possible that the first 800 records could end up being committed while the remaining 200 records fail to be inserted. This situation could happen, for example, if a duplicate key exception occurs as the 801st insertion is attempted. If this happens, AsterixDB will report the error (e.g., a duplicate key exception) as the result of the offending AQL insert statement, and the application logic above will need to take the appropriate action(s ) needed to assess the resulting state and to clean up and/or continue as appropriate. +AsterixDB supports object-level ACID transactions that begin and terminate implicitly for each object inserted, deleted, or searched while a given AQL statement is being executed. This is quite similar to the level of transaction support found in today's NoSQL stores. AsterixDB does not support multi-statement transactions, and in fact an AQL statement that involves multiple objects can itself involve multiple independent object-level transactions. An example consequence of this is that, when an AQL statement attempts to insert 1000 objects, it is possible that the first 800 objects could end up being committed while the remaining 200 objects fail to be inserted. This situation could happen, for example, if a duplicate key exception occurs as the 801st insertion is attempted. If this happens, AsterixDB will report the error (e.g., a duplicate key exception) as the result of the offending AQL insert statement, and the application logic above will need to take the appropriate action(s ) needed to assess the resulting state and to clean up and/or continue as appropriate. ## Further Help ## That's it! You are now armed and dangerous with respect to semistructured data management using AsterixDB and AQL. http://git-wip-us.apache.org/repos/asf/asterixdb/blob/10351a74/asterixdb/asterix-doc/src/site/markdown/aql/similarity.md ---------------------------------------------------------------------- diff --git a/asterixdb/asterix-doc/src/site/markdown/aql/similarity.md b/asterixdb/asterix-doc/src/site/markdown/aql/similarity.md index 9fa3d44..88ca8a5 100644 --- a/asterixdb/asterix-doc/src/site/markdown/aql/similarity.md +++ b/asterixdb/asterix-doc/src/site/markdown/aql/similarity.md @@ -30,7 +30,7 @@ ## <a id="Motivation">Motivation</a> <font size="4"><a href="#toc">[Back to TOC]</a></font> ## Similarity queries are widely used in applications where users need to -find records that satisfy a similarity predicate, while exact matching +find objects that satisfy a similarity predicate, while exact matching is not sufficient. These queries are especially important for social and Web applications, where errors, abbreviations, and inconsistencies are common. As an example, we may want to find all the movies @@ -214,7 +214,7 @@ lists of the grams in the query string. A "keyword index" is constructed on a set of strings or sets (e.g., OrderedList, UnorderedList). Instead of generating grams as in an ngram index, we generate tokens (e.g., words) and for each token, construct an inverted list that includes the ids of the -records with this token. The following two examples show how to create keyword index on two different types: +objects with this token. The following two examples show how to create keyword index on two different types: #### Keyword Index on String Type #### http://git-wip-us.apache.org/repos/asf/asterixdb/blob/10351a74/asterixdb/asterix-doc/src/site/markdown/csv.md ---------------------------------------------------------------------- diff --git a/asterixdb/asterix-doc/src/site/markdown/csv.md b/asterixdb/asterix-doc/src/site/markdown/csv.md index d761aaf..1ff03d9 100644 --- a/asterixdb/asterix-doc/src/site/markdown/csv.md +++ b/asterixdb/asterix-doc/src/site/markdown/csv.md @@ -23,15 +23,15 @@ AsterixDB supports the CSV format for both data input and query result output. In both cases, the structure of the CSV data must be defined -using a named ADM record datatype. The CSV format, limitations, and +using a named ADM object datatype. The CSV format, limitations, and MIME type are defined by [RFC 4180](https://tools.ietf.org/html/rfc4180). CSV is not as expressive as the full Asterix Data Model, meaning that not all data which can be represented in ADM can also be represented as CSV. So the form of this datatype is limited. First, obviously it -may not contain any nested records or lists, as CSV has no way to -represent nested data structures. All fields in the record type must +may not contain any nested objects or lists, as CSV has no way to +represent nested data structures. All fields in the object type must be primitive. Second, the set of supported primitive types is limited to numerics (`int8`, `int16`, `int32`, `int64`, `float`, `double`) and `string`. On output, a few additional primitive types (`boolean`, @@ -101,11 +101,11 @@ lines of data being skipped as well. ## CSV Output Any query may be rendered as CSV when using AsterixDB's HTTP -interface. To do so, there are two steps required: specify the record +interface. To do so, there are two steps required: specify the object type which defines the schema of your CSV, and request that Asterix use the CSV output format. -#### Output Record Type +#### Output Object Type Background: The result of any AQL query is an unordered list of _instances_, where each _instance_ is an instance of an AQL @@ -113,15 +113,15 @@ datatype. When requesting CSV output, there are some restrictions on the legal datatypes in this unordered list due to the limited expressability of CSV: -1. Each instance must be of a record type. -2. Each instance must be of the _same_ record type. -3. The record type must conform to the content and type restrictions +1. Each instance must be of a object type. +2. Each instance must be of the _same_ object type. +3. The object type must conform to the content and type restrictions mentioned in the introduction. While it would be possible to structure your query to cast all result instances to a given type, it is not necessary. AQL offers a built-in feature which will automatically cast all top-level instances in the -result to a specified named ADM record type. To enable this feature, +result to a specified named ADM object type. To enable this feature, use a `set` statement prior to the query to set the parameter `output-record-type` to the name of an ADM type. This type must have already been defined in the current dataverse. @@ -142,7 +142,7 @@ from different underlying datasets, etc. Two notes about `output-record-type`: 1. This feature is not strictly related to CSV; it may be used with -any output formats (in which case, any record datatype may be +any output formats (in which case, any object datatype may be specified, not subject to the limitations specified in the introduction of this page). 2. When the CSV output format is requested, `output-record-type` is in @@ -230,16 +230,16 @@ HTTP Accept: header. This is consistent with RFC 4180. #### Issues with open datatypes and optional fields -As mentioned earlier, CSV is a rigid format. It cannot express records +As mentioned earlier, CSV is a rigid format. It cannot express objects with different numbers of fields, which ADM allows through both open datatypes and optional fields. -If your output record type contains optional fields, this will not +If your output object type contains optional fields, this will not result in any errors. If the output data of a query does not contain values for an optional field, this will be represented in CSV as `null`. -If your output record type is open, this will also not result in any +If your output object type is open, this will also not result in any errors. If the output data of a query contains any open fields, the corresponding rows in the resulting CSV will contain more comma-separated values than the others. On each such row, the data @@ -253,6 +253,6 @@ the file"). Hence it will likely not be handled consistently by all CSV processors. Some may throw a parsing error. If you attempt to load this data into AsterixDB later using `load dataset`, the extra fields will be silently ignored. For this reason it is recommended that you -use only closed datatypes as output record types. AsterixDB allows to -use an open record type only to support cases where the type already +use only closed datatypes as output object types. AsterixDB allows to +use an open object type only to support cases where the type already exists for other parts of your application. http://git-wip-us.apache.org/repos/asf/asterixdb/blob/10351a74/asterixdb/asterix-doc/src/site/markdown/datamodel.md ---------------------------------------------------------------------- diff --git a/asterixdb/asterix-doc/src/site/markdown/datamodel.md b/asterixdb/asterix-doc/src/site/markdown/datamodel.md index 5a5aced..cba0d18 100644 --- a/asterixdb/asterix-doc/src/site/markdown/datamodel.md +++ b/asterixdb/asterix-doc/src/site/markdown/datamodel.md @@ -43,7 +43,7 @@ * [Null](#IncompleteInformationTypesNull) * [Missing](#IncompleteInformationTypesMissing) * [Derived Types](#DerivedTypes) - * [Record](#DerivedTypesRecord) + * [Object](#DerivedTypesObject) * [Array](#DerivedTypesArray) * [Multiset](#DerivedTypesMultiset) @@ -350,12 +350,12 @@ For example, a user might not be able to know the value of a field and let it be ### <a id="IncompleteInformationTypesMissing">Missing</a> ### -`missing` represents a missing name-value pair in a record. +`missing` represents a missing name-value pair in a object. If the referenced field does not exist, an empty result value is returned by the query. As neither the data model nor the system enforces homogeneity for datasets or collections, items in a dataset or collection can be of heterogeneous types and -so a field can be present in one record and `missing` in another. +so a field can be present in one object and `missing` in another. * Example: @@ -366,12 +366,12 @@ so a field can be present in one record and `missing` in another. { } -Since a field with value `missing` means the field is absent, we get an empty record. +Since a field with value `missing` means the field is absent, we get an empty object. ## <a id="DerivedTypes">Derived Types</a> ## -### <a id="DerivedTypesRecord">Record</a>### -A `record` contains a set of ï¬elds, where each ï¬eld is described by its name and type. A record type is either open or closed. Open records can contain ï¬elds that are not part of the type deï¬nition, while closed records cannot. Syntactically, record constructors are surrounded by curly braces "{...}". +### <a id="DerivedTypesObject">Object</a>### +A `object` contains a set of ï¬elds, where each ï¬eld is described by its name and type. A object type is either open or closed. Open objects can contain ï¬elds that are not part of the type deï¬nition, while closed objects cannot. Syntactically, object constructors are surrounded by curly braces "{...}". An example would be http://git-wip-us.apache.org/repos/asf/asterixdb/blob/10351a74/asterixdb/asterix-doc/src/site/markdown/feeds/tutorial.md ---------------------------------------------------------------------- diff --git a/asterixdb/asterix-doc/src/site/markdown/feeds/tutorial.md b/asterixdb/asterix-doc/src/site/markdown/feeds/tutorial.md index fb06a92..948c7aa 100644 --- a/asterixdb/asterix-doc/src/site/markdown/feeds/tutorial.md +++ b/asterixdb/asterix-doc/src/site/markdown/feeds/tutorial.md @@ -34,7 +34,7 @@ used to live outside, and we show how it improves users' lives and system perfor ## <a name="FeedAdaptors">Feed Adaptors</a> ## The functionality of establishing a connection with a data source -and receiving, parsing and translating its data into ADM records +and receiving, parsing and translating its data into ADM objects (for storage inside AsterixDB) is contained in a feed adaptor. A feed adaptor is an implementation of an interface and its details are specific to a given data source. An adaptor may optionally be given @@ -54,7 +54,7 @@ to ingest data that is directed at a prescribed socket. In this tutorial, we shall describe building two example data ingestion pipelines that cover the popular scenario of ingesting data from (a) Twitter and (b) RSS Feed source. -####Ingesting Twitter Stream +####Ingesting Twitter Stream We shall use the built-in push-based Twitter adaptor. As a pre-requisite, we must define a Tweet using the AsterixDB Data Model (ADM) and the AsterixDB Query Language (AQL). Given below are the type definition in AQL that create a Tweet datatype which is representative of a real tweet as obtained from Twitter. @@ -229,17 +229,17 @@ policy that is expressed as a collection of parameters and associated values. An ingestion policy dictates the runtime behavior of the feed in response to resource bottlenecks and failures. AsterixDB provides a list of policy parameters that help customize the -system's runtime behavior when handling excess records. AsterixDB +system's runtime behavior when handling excess objects. AsterixDB provides a set of built-in policies, each constructed by setting appropriate value(s) for the policy parameter(s) from the table below. -####Policy Parameters +####Policy Parameters -- *excess.records.spill*: Set to true if records that cannot be processed by an operator for lack of resources (referred to as excess records hereafter) should be persisted to the local disk for deferred processing. (Default: false) +- *excess.records.spill*: Set to true if objects that cannot be processed by an operator for lack of resources (referred to as excess objects hereafter) should be persisted to the local disk for deferred processing. (Default: false) -- *excess.records.discard*: Set to true if excess records should be discarded. (Default: false) +- *excess.records.discard*: Set to true if excess objects should be discarded. (Default: false) -- *excess.records.throttle*: Set to true if rate of arrival of records is required to be reduced in an adaptive manner to prevent having any excess records (Default: false) +- *excess.records.throttle*: Set to true if rate of arrival of objects is required to be reduced in an adaptive manner to prevent having any excess objects (Default: false) - *excess.records.elastic*: Set to true if the system should attempt to resolve resource bottlenecks by re-structuring and/or rescheduling the feed ingestion pipeline. (Default: false) @@ -249,7 +249,7 @@ appropriate value(s) for the policy parameter(s) from the table below. Note that the end user may choose to form a custom policy. For example, it is possible in AsterixDB to create a custom policy that spills excess -records to disk and subsequently resorts to throttling if the +objects to disk and subsequently resorts to throttling if the spillage crosses a configured threshold. In all cases, the desired ingestion policy is specified as part of the `connect feed` statement or else the "Basic" policy will be chosen as the default. http://git-wip-us.apache.org/repos/asf/asterixdb/blob/10351a74/asterixdb/asterix-doc/src/site/markdown/sqlpp/primer-sqlpp.md ---------------------------------------------------------------------- diff --git a/asterixdb/asterix-doc/src/site/markdown/sqlpp/primer-sqlpp.md b/asterixdb/asterix-doc/src/site/markdown/sqlpp/primer-sqlpp.md index af63520..7dc1953 100644 --- a/asterixdb/asterix-doc/src/site/markdown/sqlpp/primer-sqlpp.md +++ b/asterixdb/asterix-doc/src/site/markdown/sqlpp/primer-sqlpp.md @@ -134,12 +134,12 @@ a number of possibilities. The first three lines above tell AsterixDB to drop the old TinySocial dataverse, if one already exists, and then to create a brand new one and make it the focus of the statements that follow. The first _CREATE TYPE_ statement creates a datatype for holding information about Chirp users. -It is a record type with a mix of integer and string data, very much like a (flat) relational tuple. +It is a object type with a mix of integer and string data, very much like a (flat) relational tuple. The indicated fields are all mandatory, but because the type is open, additional fields are welcome. The second statement creates a datatype for Chirp messages; this shows how to specify a closed type. Interestingly (based on one of Chirp's APIs), each Chirp message actually embeds an instance of the sending user's information (current as of when the message was sent), so this is an example of a nested -record in ADM. +object in ADM. Chirp messages can optionally contain the sender's location, which is modeled via the senderLocation field of spatial type _point_; the question mark following the field type indicates its optionality. An optional field is like a nullable field in SQL---it may be present or missing, but when it's present, @@ -149,11 +149,11 @@ Lastly, the referredTopics field illustrates another way that ADM is richer than this field holds a bag (*a.k.a.* an unordered list) of strings. Since the overall datatype definition for Chirp messages says "closed", the fields that it lists are the only fields that instances of this type will be allowed to contain. -The next two _CREATE TYPE_ statements create a record type for holding information about one component of -the employment history of a Gleambook user and then a record type for holding the user information itself. +The next two _CREATE TYPE_ statements create a object type for holding information about one component of +the employment history of a Gleambook user and then a object type for holding the user information itself. The Gleambook user type highlights a few additional ADM data model features. Its friendIds field is a bag of integers, presumably the Gleambook user ids for this user's friends, -and its employment field is an ordered list of employment records. +and its employment field is an ordered list of employment objects. The final _CREATE TYPE_ statement defines a type for handling the content of a Gleambook message in our hypothetical social data storage scenario. @@ -242,14 +242,14 @@ referenced in the most recently executed _USE_ directive. Second, they show how to escape SQL++ keywords (or other special names) in object names by using backquotes. Last but not least, they show that SQL++ supports a _SELECT VALUE_ variation of SQL's traditional _SELECT_ statement that returns a single value (or element) from a query instead of constructing a new -record as the query's result like _SELECT_ does; here, the returned value is an entire record from +object as the query's result like _SELECT_ does; here, the returned value is an entire object from the dataset being queried (e.g., _SELECT VALUE ds_ in the first statement returns the entire -record from the metadata dataset containing the descriptions of all datasets. +object from the metadata dataset containing the descriptions of all datasets. ## Loading Data Into AsterixDB ## Okay, so far so good---AsterixDB is now ready for data, so let's give it some data to store. Our next task will be to load some sample data into the four datasets that we just defined. -Here we will load a tiny set of records, defined in ADM format (a superset of JSON), into each dataset. +Here we will load a tiny set of objects, defined in ADM format (a superset of JSON), into each dataset. In the boxes below you can see the actual data instances contained in each of the provided sample files. In order to load this data yourself, you should first store the four corresponding `.adm` files (whose URLs are indicated on top of each box below) into a filesystem directory accessible to your @@ -313,7 +313,7 @@ of the data instances will be stored separately from their associated field name {"messageId":14,"authorId":9,"inResponseTo":12,"senderLocation":point("41.33,85.28"),"message":" love at&t its 3G is good:)"} {"messageId":15,"authorId":7,"inResponseTo":11,"senderLocation":point("44.47,67.11"),"message":" like iphone the voicemail-service is awesome"} -It's loading time! We can use SQL++ _LOAD_ statements to populate our datasets with the sample records shown above. +It's loading time! We can use SQL++ _LOAD_ statements to populate our datasets with the sample objects shown above. The following shows how loading can be done for data stored in `.adm` files in your local filesystem. *Note:* You _MUST_ replace the `<Host Name>` and `<Absolute File Path>` placeholders in each load statement below with valid values based on the host IP address (or host name) for the machine and @@ -384,7 +384,7 @@ Suppose the user we want is the user whose id is 8: As in SQL, the query's _FROM_ clause binds the variable `user` incrementally to the data instances residing in the dataset named GleambookUsers. Its _WHERE_ clause selects only those bindings having a user id of interest, filtering out the rest. -The _SELECT_ _VALUE_ clause returns the (entire) data value (a Gleambook user record in this case) +The _SELECT_ _VALUE_ clause returns the (entire) data value (a Gleambook user object in this case) for each binding that satisfies the predicate. Since this dataset is indexed on user id (its primary key), this query will be done via a quick index lookup. @@ -442,10 +442,10 @@ We could do this as follows in SQL++: WHERE msg.authorId = user.id; The result of this query is a sequence of new ADM instances, one for each author/message pair. -Each instance in the result will be an ADM record containing two fields, "uname" and "message", +Each instance in the result will be an ADM object containing two fields, "uname" and "message", containing the user's name and the message text, respectively, for each author/message pair. Notice how the use of a traditional SQL-style _SELECT_ clause, as opposed to the new SQL++ _SELECT VALUE_ -clause, automatically results in the construction of a new record value for each result. +clause, automatically results in the construction of a new object value for each result. The expected result of this example SQL++ join query for our sample data set is: @@ -473,9 +473,9 @@ If we were feeling lazy, we might use _SELECT *_ in SQL++ to return all of the m FROM GleambookUsers user, GleambookMessages msg WHERE msg.authorId = user.id; -In SQL++, this _SELECT *_ query will produce a new nested record for each user/message pair. -Each result record contains one field (named after the "user" variable) to hold the user record -and another field (named after the "msg" variable) to hold the matching message record. +In SQL++, this _SELECT *_ query will produce a new nested object for each user/message pair. +Each result object contains one field (named after the "user" variable) to hold the user object +and another field (named after the "msg" variable) to hold the matching message object. Note that the nested nature of this SQL++ _SELECT *_ result is different than traditional SQL, as SQL was not designed to handle the richer, nested data model that underlies the design of SQL++. @@ -505,7 +505,7 @@ Finally (for now :-)), another less lazy and more explicit SQL++ way of achievin FROM GleambookUsers user, GleambookMessages msg WHERE msg.authorId = user.id; -This version of the query uses an explicit record constructor to build each result record. +This version of the query uses an explicit object constructor to build each result object. (Note that "uname" and "message" are both simple SQL++ expressions themselves---so in the most general case, even the resulting field names can be computed as part of the query, making SQL++ a very powerful tool for slicing and dicing semistructured data.) @@ -532,7 +532,7 @@ that it should consider employing an index-based nested-loop join technique to p WHERE msg.authorId /*+ indexnl */ = user.id; In addition to illustrating the use of a hint, the query also shows how to achieve the same -result record format using _SELECT_ and _AS_ instead of using an explicit record constructor. +result object format using _SELECT_ and _AS_ instead of using an explicit object constructor. The expected result is (of course) the same as before, modulo the order of the instances. Result ordering is (intentionally) undefined in SQL++ in the absence of an _ORDER BY_ clause. The query result for our sample data in this case is: @@ -567,7 +567,7 @@ grouped by customer, without omitting those customers who haven't placed any ord The SQL++ language supports nesting, both of queries and of query results, and the combination allows for an arguably cleaner/more natural approach to such queries. -As an example, supposed we wanted, for each Gleambook user, to produce a record that has his/her name +As an example, supposed we wanted, for each Gleambook user, to produce a object that has his/her name plus a list of the messages written by that user. In SQL, this would involve a left outer join between users and messages, grouping by user, and having the user name repeated along side each message. @@ -582,7 +582,7 @@ In SQL++, this sort of use case can be handled (more naturally) as follows: FROM GleambookUsers user; This SQL++ query binds the variable `user` to the data instances in GleambookUsers; -for each user, it constructs a result record containing a "uname" field with the user's +for each user, it constructs a result object containing a "uname" field with the user's name and a "messages" field with a nested collection of all messages for that user. The nested collection for each user is specified by using a correlated subquery. (Note: While it looks like nested loops could be involved in computing the result, @@ -673,7 +673,7 @@ The expected result for this query against our sample data is: The expressive power of SQL++ includes support for queries involving "some" (existentially quantified) and "all" (universally quantified) query semantics. As an example of an existential SQL++ query, here we show a query to list the Gleambook users who are currently employed. -Such employees will have an employment history containing a record in which the end-date field is _MISSING_ +Such employees will have an employment history containing a object in which the end-date field is _MISSING_ (or it could be there but have the value _NULL_, as JSON unfortunately provides two ways to represent unknown values). This leads us to the following SQL++ query: @@ -695,7 +695,7 @@ The expected result in this case is: ### Query 7 - Universal Quantification ### As an example of a universal SQL++ query, here we show a query to list the Gleambook users who are currently unemployed. -Such employees will have an employment history containing no records with unknown end-date field values, leading us to the +Such employees will have an employment history containing no objects with unknown end-date field values, leading us to the following SQL++ query: USE TinySocial; @@ -759,11 +759,11 @@ Thus, due to the _GROUP BY_ clause, the _SELECT_ clause in this query sees a seq with each such group having an associated _uid_ variable value (i.e., the chirping user's screen name). In the context of the _SELECT_ clause, _uid_ is bound to the chirper's id and _cm_ is now re-bound (due to grouping) to the _set_ of chirps issued by that chirper. -The _SELECT_ clause yields a result record containing the chirper's user id and the count of the items +The _SELECT_ clause yields a result object containing the chirper's user id and the count of the items in the associated chirp set. -The query result will contain one such record per screen name. +The query result will contain one such object per screen name. This query also illustrates another feature of SQL++; notice how each user's screen name is accessed via a -path syntax that traverses each chirp's nested record structure. +path syntax that traverses each chirp's nested object structure. Here is the expected result for this query over the sample data: @@ -835,7 +835,7 @@ finds all of the chirps that are similar based on the topics that they refer to: This query illustrates several things worth knowing in order to write fuzzy queries in SQL++. First, as mentioned earlier, SQL++ offers an operator-based syntax (as well as a functional approach, not shown) for seeing whether two values are "similar" to one another or not. -Second, recall that the referredTopics field of records of datatype ChirpMessageType is a bag of strings. +Second, recall that the referredTopics field of objects of datatype ChirpMessageType is a bag of strings. This query sets the context for its similarity join by requesting that Jaccard-based similarity semantics ([http://en.wikipedia.org/wiki/Jaccard_index](http://en.wikipedia.org/wiki/Jaccard_index)) be used for the query's similarity operator and that a similarity index of 0.3 be used as its similarity threshold. @@ -884,7 +884,7 @@ have all gone up in the interim, although he appears not to have moved in the la In general, the data to be inserted may be specified using any valid SQL++ query expression. The insertion of a single object instance, as in this example, is just a special case where -the query expression happens to be a record constructor involving only constants. +the query expression happens to be a object constructor involving only constants. ### Deleting Existing Data ### In addition to inserting new data, AsterixDB supports deletion from datasets via the SQL++ _DELETE_ statement. @@ -898,16 +898,16 @@ The following example deletes the chirp that we just added from user "NathanGies It should be noted that one form of data change not yet supported by AsterixDB is in-place data modification (_update_). Currently, only insert and delete operations are supported in SQL++; updates are not. -To achieve the effect of an update, two SQL++ statements are currently needed---one to delete the old record from the -dataset where it resides, and another to insert the new replacement record (with the same primary key but with +To achieve the effect of an update, two SQL++ statements are currently needed---one to delete the old object from the +dataset where it resides, and another to insert the new replacement object (with the same primary key but with different field values for some of the associated data content). -AQL additionally supports an upsert operation to either insert a record, if no record with its primary key is currently -present in the dataset, or to replace the existing record if one already exists with the primary key value being upserted. +AQL additionally supports an upsert operation to either insert a object, if no object with its primary key is currently +present in the dataset, or to replace the existing object if one already exists with the primary key value being upserted. SQL++ will soon have _UPSERT_ as well. ### Transaction Support -AsterixDB supports record-level ACID transactions that begin and terminate implicitly for each record inserted, deleted, or searched while a given SQL++ statement is being executed. This is quite similar to the level of transaction support found in today's NoSQL stores. AsterixDB does not support multi-statement transactions, and in fact an SQL++ statement that involves multiple records can itself involve multiple independent record-level transactions. An example consequence of this is that, when an SQL++ statement attempts to insert 1000 records, it is possible that the first 800 records could end up being committed while the remaining 200 records fail to be inserted. This situation could happen, for example, if a duplicate key exception occurs as the 801st insertion is attempted. If this happens, AsterixDB will report the error (e.g., a duplicate key exception) as the result of the offending SQL++ _INSERT_ statement, and the application logic above will need to take the appropriat e action(s) needed to assess the resulting state and to clean up and/or continue as appropriate. +AsterixDB supports object-level ACID transactions that begin and terminate implicitly for each object inserted, deleted, or searched while a given SQL++ statement is being executed. This is quite similar to the level of transaction support found in today's NoSQL stores. AsterixDB does not support multi-statement transactions, and in fact an SQL++ statement that involves multiple objects can itself involve multiple independent object-level transactions. An example consequence of this is that, when an SQL++ statement attempts to insert 1000 objects, it is possible that the first 800 objects could end up being committed while the remaining 200 objects fail to be inserted. This situation could happen, for example, if a duplicate key exception occurs as the 801st insertion is attempted. If this happens, AsterixDB will report the error (e.g., a duplicate key exception) as the result of the offending SQL++ _INSERT_ statement, and the application logic above will need to take the appropriat e action(s) needed to assess the resulting state and to clean up and/or continue as appropriate. ## Further Help ## That's it! You are now armed and dangerous with respect to semistructured data management using AsterixDB via SQL++. http://git-wip-us.apache.org/repos/asf/asterixdb/blob/10351a74/asterixdb/asterix-doc/src/site/markdown/udf.md ---------------------------------------------------------------------- diff --git a/asterixdb/asterix-doc/src/site/markdown/udf.md b/asterixdb/asterix-doc/src/site/markdown/udf.md index 0e1db87..b2ef2bc 100644 --- a/asterixdb/asterix-doc/src/site/markdown/udf.md +++ b/asterixdb/asterix-doc/src/site/markdown/udf.md @@ -78,12 +78,12 @@ Our library is now installed and is ready to be used. In the following we assume that you already created the `TwitterFeed` and its corresponding data types and dataset following the instruction explained in the [feeds tutorial](feeds/tutorial.html). A feed definition may optionally include the specification of a -user-defined function that is to be applied to each feed record prior +user-defined function that is to be applied to each feed object prior to persistence. Examples of pre-processing might include adding -attributes, filtering out records, sampling, sentiment analysis, feature +attributes, filtering out objects, sampling, sentiment analysis, feature extraction, etc. We can express a UDF, which can be defined in AQL or in a programming language such as Java, to perform such pre-processing. An AQL UDF is a good fit when -pre-processing a record requires the result of a query (join or aggregate) +pre-processing a object requires the result of a query (join or aggregate) over data contained in AsterixDB datasets. More sophisticated processing such as sentiment analysis of text is better handled by providing a Java UDF. A Java UDF has an initialization phase @@ -145,9 +145,9 @@ could provide data for multiple applications. To achieve this, we introduce the notion of primary and secondary feeds in AsterixDB. A feed in AsterixDB is considered to be a primary feed if it gets -its data from an external data source. The records contained in a +its data from an external data source. The objects contained in a feed (subsequent to any pre-processing) are directed to a designated -AsterixDB dataset. Alternatively or additionally, these records can +AsterixDB dataset. Alternatively or additionally, these objects can be used to derive other feeds known as secondary feeds. A secondary feed is similar to its parent feed in every other aspect; it can have an associated UDF to allow for any subsequent processing, @@ -167,7 +167,7 @@ respective parent feed (TwitterFeed). connect feed ProcessedTwitterFeed to dataset ProcessedTweets; -The `addHashTags` function is already provided in the example UDF.To see what records +The `addHashTags` function is already provided in the example UDF.To see what objects are being inserted into the dataset, we can perform a simple dataset scan after allowing a few moments for the feed to start ingesting data: http://git-wip-us.apache.org/repos/asf/asterixdb/blob/10351a74/asterixdb/asterix-lang-common/src/main/java/org/apache/asterix/lang/common/util/CommonFunctionMapUtil.java ---------------------------------------------------------------------- diff --git a/asterixdb/asterix-lang-common/src/main/java/org/apache/asterix/lang/common/util/CommonFunctionMapUtil.java b/asterixdb/asterix-lang-common/src/main/java/org/apache/asterix/lang/common/util/CommonFunctionMapUtil.java index ea373cb..f8b2050 100644 --- a/asterixdb/asterix-lang-common/src/main/java/org/apache/asterix/lang/common/util/CommonFunctionMapUtil.java +++ b/asterixdb/asterix-lang-common/src/main/java/org/apache/asterix/lang/common/util/CommonFunctionMapUtil.java @@ -63,8 +63,16 @@ public class CommonFunctionMapUtil { FUNCTION_NAME_MAP.put("isobject", "is-object"); // isobject, internal: is-object FUNCTION_NAME_MAP.put("isobj", "is-object"); // isobj, internal: is-object - // Record functions. - FUNCTION_NAME_MAP.put("object_pairs", "record-pairs"); // object_pairs, internal: record-pairs + // Object functions + FUNCTION_NAME_MAP.put("record-merge", "object-merge"); // record-merge, internal: object-merge + // record-get-fields, internal: object-get-fields + FUNCTION_NAME_MAP.put("record-get-fields", "object-get-fields"); + // record-get-field-value, internal: object-get-field-value + FUNCTION_NAME_MAP.put("record-get-field-value", "object-get-field-value"); + // record-add-fields, internal: object-add-fields + FUNCTION_NAME_MAP.put("record-add-fields", "object-add-fields"); + // record-remove-fields, internal: object-remove-fields + FUNCTION_NAME_MAP.put("record-remove-fields", "object-remove-fields"); } private CommonFunctionMapUtil() { http://git-wip-us.apache.org/repos/asf/asterixdb/blob/10351a74/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/functions/AsterixBuiltinFunctions.java ---------------------------------------------------------------------- diff --git a/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/functions/AsterixBuiltinFunctions.java b/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/functions/AsterixBuiltinFunctions.java index 29d6e88..66581c4 100644 --- a/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/functions/AsterixBuiltinFunctions.java +++ b/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/functions/AsterixBuiltinFunctions.java @@ -167,18 +167,18 @@ public class AsterixBuiltinFunctions { public static final FunctionIdentifier DEEP_EQUAL = new FunctionIdentifier(FunctionConstants.ASTERIX_NS, "deep-equal", 2); - // records + // objects public static final FunctionIdentifier RECORD_MERGE = new FunctionIdentifier(FunctionConstants.ASTERIX_NS, - "record-merge", 2); + "object-merge", 2); public static final FunctionIdentifier REMOVE_FIELDS = new FunctionIdentifier(FunctionConstants.ASTERIX_NS, - "record-remove-fields", 2); + "object-remove-fields", 2); public static final FunctionIdentifier ADD_FIELDS = new FunctionIdentifier(FunctionConstants.ASTERIX_NS, - "record-add-fields", 2); + "object-add-fields", 2); public static final FunctionIdentifier CLOSED_RECORD_CONSTRUCTOR = new FunctionIdentifier( - FunctionConstants.ASTERIX_NS, "closed-record-constructor", FunctionIdentifier.VARARGS); + FunctionConstants.ASTERIX_NS, "closed-object-constructor", FunctionIdentifier.VARARGS); public static final FunctionIdentifier OPEN_RECORD_CONSTRUCTOR = new FunctionIdentifier( - FunctionConstants.ASTERIX_NS, "open-record-constructor", FunctionIdentifier.VARARGS); + FunctionConstants.ASTERIX_NS, "open-object-constructor", FunctionIdentifier.VARARGS); public static final FunctionIdentifier FIELD_ACCESS_BY_INDEX = new FunctionIdentifier(FunctionConstants.ASTERIX_NS, "field-access-by-index", 2); public static final FunctionIdentifier FIELD_ACCESS_BY_NAME = new FunctionIdentifier(FunctionConstants.ASTERIX_NS, @@ -186,11 +186,11 @@ public class AsterixBuiltinFunctions { public static final FunctionIdentifier FIELD_ACCESS_NESTED = new FunctionIdentifier(FunctionConstants.ASTERIX_NS, "field-access-nested", 2); public static final FunctionIdentifier GET_RECORD_FIELDS = new FunctionIdentifier(FunctionConstants.ASTERIX_NS, - "get-record-fields", 1); + "get-object-fields", 1); public static final FunctionIdentifier GET_RECORD_FIELD_VALUE = new FunctionIdentifier(FunctionConstants.ASTERIX_NS, - "get-record-field-value", 2); + "get-object-field-value", 2); public static final FunctionIdentifier RECORD_PAIRS = new FunctionIdentifier(FunctionConstants.ASTERIX_NS, - "record-pairs", FunctionIdentifier.VARARGS); + "object-pairs", FunctionIdentifier.VARARGS); // numeric public static final FunctionIdentifier NUMERIC_UNARY_MINUS = new FunctionIdentifier(FunctionConstants.ASTERIX_NS, @@ -646,7 +646,7 @@ public class AsterixBuiltinFunctions { public static final FunctionIdentifier INJECT_FAILURE = new FunctionIdentifier(FunctionConstants.ASTERIX_NS, "inject-failure", 2); public static final FunctionIdentifier FLOW_RECORD = new FunctionIdentifier(FunctionConstants.ASTERIX_NS, - "flow-record", 1); + "flow-object", 1); public static final FunctionIdentifier CAST_TYPE = new FunctionIdentifier(FunctionConstants.ASTERIX_NS, "cast", 1); @@ -1055,7 +1055,7 @@ public class AsterixBuiltinFunctions { addPrivateFunction(UNORDERED_LIST_CONSTRUCTOR, UnorderedListConstructorTypeComputer.INSTANCE, true); addFunction(WORD_TOKENS, OrderedListOfAStringTypeComputer.INSTANCE, true); - // records + // objects addFunction(RECORD_MERGE, RecordMergeTypeComputer.INSTANCE, true); addFunction(ADD_FIELDS, RecordAddFieldsTypeComputer.INSTANCE, true); addFunction(REMOVE_FIELDS, RecordRemoveFieldsTypeComputer.INSTANCE, true);