drill into gh-pages

tshiran Fri, 29 May 2015 22:04:00 -0700

Merge branch 'gh-pages' of https://github.com/tshiran/drill into gh-pages


Conflicts:
        _docs/data-sources-and-file-formats/050-json-data-model.md


Project: http://git-wip-us.apache.org/repos/asf/drill/repo
Commit: http://git-wip-us.apache.org/repos/asf/drill/commit/a15a7f11
Tree: http://git-wip-us.apache.org/repos/asf/drill/tree/a15a7f11
Diff: http://git-wip-us.apache.org/repos/asf/drill/diff/a15a7f11

Branch: refs/heads/gh-pages
Commit: a15a7f118d245f4bbec0a4d9aa951c9990ea0ae6
Parents: 446d71c d6f216a
Author: Tomer Shiran <[email protected]>
Authored: Fri May 29 21:33:15 2015 -0700
Committer: Tomer Shiran <[email protected]>
Committed: Fri May 29 21:33:15 2015 -0700

----------------------------------------------------------------------
 _data/docs.json                                 | 112 +++++++--------
 .../020-configuring-drill-memory.md             |   4 +-
 .../040-persistent-configuration-storage.md     |   2 +-
 .../010-connect-a-data-source-introduction.md   |   9 +-
 .../040-file-system-storage-plugin.md           | 105 ++++++++++++++
 _docs/connect-a-data-source/040-workspaces.md   |  76 ----------
 .../050-file-system-storage-plugin.md           |  64 ---------
 _docs/connect-a-data-source/050-workspaces.md   |  33 +++++
 .../connect-a-data-source/100-mapr-db-format.md |   3 +-
 .../050-json-data-model.md                      | 137 +++++++++----------
 .../020-develop-a-simple-function.md            |   4 +-
 .../030-developing-an-aggregate-function.md     |  18 +++
 .../060-custom-function-interfaces.md           |  14 +-
 .../010-apache-drill-contribution-guidelines.md |   2 +-
 .../design-docs/050-value-vectors.md            |   2 +-
 _docs/img/connect-plugin.png                    | Bin 36731 -> 41222 bytes
 .../020-using-jdbc-with-squirrel-on-windows.md  |   2 +-
 .../data-types/010-supported-data-types.md      |   7 +
 .../sql-functions/020-data-type-conversion.md   |  10 +-
 19 files changed, 313 insertions(+), 291 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/drill/blob/a15a7f11/_docs/data-sources-and-file-formats/050-json-data-model.md
----------------------------------------------------------------------
diff --cc _docs/data-sources-and-file-formats/050-json-data-model.md
index 1b1660d,1bc4cec..0c51757
--- a/_docs/data-sources-and-file-formats/050-json-data-model.md
+++ b/_docs/data-sources-and-file-formats/050-json-data-model.md
@@@ -10,9 -10,9 +10,9 @@@ Drill supports [JSON (JavaScript Objec
  
  Semi-structured JSON data often consists of complex, nested elements having 
schema-less fields that differ type-wise from row to row. The data can 
constantly evolve. Applications typically add and remove fields frequently to 
meet business requirements.
  
--Using Drill you can natively query dynamic JSON data sets using SQL. Drill 
treats a JSON object as a SQL record. One object equals one row in a Drill 
table. 
++Using Drill you can natively query dynamic JSON data sets using SQL. Drill 
treats a JSON object as a SQL record. One object equals one row in a Drill 
table.
  
--You can also [query compressed .gz files]({{ site.baseurl 
}}/docs/drill-default-input-format#querying-compressed-json) having JSON as 
well as uncompressed .json files. 
++You can also [query compressed .gz files]({{ site.baseurl 
}}/docs/drill-default-input-format#querying-compressed-json) having JSON as 
well as uncompressed .json files.
  
  In addition to the examples presented later in this section, see ["How to 
Analyze Highly Dynamic Datasets with Apache 
Drill"](https://www.mapr.com/blog/how-analyze-highly-dynamic-datasets-apache-drill)
 for information about how to analyze a JSON data set.
  
@@@ -21,14 -21,14 +21,14 @@@ JSON data consists of the following typ
  
  * Array: ordered values, separated by commas, enclosed in square brackets
  * Boolean: true or false
--* Number: double-precision floating point number, including exponential 
numbers. No octal, hexadecimal, NaN, or Infinity 
++* Number: double-precision floating point number, including exponential 
numbers. No octal, hexadecimal, NaN, or Infinity
  * null: empty value
  * Object: unordered key/value collection enclosed in curly braces
  * String: Unicode enclosed in double quotation marks
  * Value: a string, number, true, false, null
  * Whitespace: used between tokens
  
--The following table shows SQL-JSON data type mapping: 
++The following table shows SQL-JSON data type mapping:
  
  | SQL Type | JSON Type | Description                                          
                                         |
  
|----------|-----------|-----------------------------------------------------------------------------------------------|
@@@ -58,10 -58,10 +58,10 @@@ When you set this option, Drill reads a
  
  [âQuery Complex Dataâ]({{ site.baseurl 
}}/docs/querying-complex-data-introduction) show how to use [composite 
types]({{site.baseurl}}/docs/supported-data-types/#composite-types) to access 
nested arrays.
  
--Drill uses these types internally for reading complex and nested data 
structures from data sources such as JSON. 
++Drill uses these types internally for reading complex and nested data 
structures from data sources such as JSON.
  
  ## Reading JSON
--To read JSON data using Drill, use a [file system storage plugin]({{ 
site.baseurl }}/docs/file-system-storage-plugin/) that defines the JSON format. 
You can use the `dfs` storage plugin, which includes the definition. 
++To read JSON data using Drill, use a [file system storage plugin]({{ 
site.baseurl }}/docs/file-system-storage-plugin/) that defines the JSON format. 
You can use the `dfs` storage plugin, which includes the definition.
  
  JSON data is often complex. Data can be deeply nested and semi-structured. 
but you can use [workarounds ]({{ site.baseurl 
}}/docs/json-data-model/#limitations-and-workarounds) covered later.
  
@@@ -69,8 -69,8 +69,8 @@@ Drill reads tuples defined in single ob
  
      { name: "Apples", desc: "Delicious" }
      { name: "Oranges", desc: "Florida Navel" }
--    
--To read and [analyze complex JSON]({{ site.baseurl 
}}/docs/json-data-model#analyzing-json) files, use the FLATTEN and KVGEN 
functions. 
++
++To read and [analyze complex JSON]({{ site.baseurl 
}}/docs/json-data-model#analyzing-json) files, use the FLATTEN and KVGEN 
functions.
  
  ## Writing JSON
  You can write data from Drill to a JSON file. The following setup is required:
@@@ -90,7 -90,7 +90,7 @@@
  * Set the output format to JSON. For example:
  
          ALTER SESSION SET `store.format`='json';
--    
++
  * Use the path to the workspace location in a CTAS command. for example:
  
          USE myplugin.myworkspace;
@@@ -98,7 -98,7 +98,7 @@@
          SELECT my column from dfs.`<path_file_name>`;
  
  Drill performs the following actions, as shown in the complete [CTAS command 
example]({{ site.baseurl }}/docs/create-table-as-ctas/):
--   
++
  * Creates a directory using table name.
  * Writes the JSON data to the directory in the workspace location.
  
@@@ -109,11 -109,11 +109,11 @@@ Generally, you query JSON files using t
  * Dot notation to drill down into a JSON map.
  
          SELECT t.level1.level2. . . . leveln FROM <storage plugin 
location>`myfile.json` t
--        
++
  * Use square brackets, array-style notation to drill down into a JSON array.
  
          SELECT t.level1.level2[n][2] FROM <storage plugin 
location>`myfile.json` t;
--    
++
    The first index position of an array is 0.
  
  * Do not use a map, array or repeated scalar type in GROUP BY, ORDER BY or in 
a comparison operator.
@@@ -122,11 -122,11 +122,12 @@@ Drill returns null when a document doe
  
  Using the following techniques, you can query complex, nested JSON:
  
--* Flatten nested data 
++* Flatten nested data
  * Generate key/value pairs for loosely structured data
  
  ## Example: Flatten and Generate Key Values for Complex JSON
- This example uses the following data that represents unit sales of tickets to 
events that were sold over a period of for several days in December:
++
+ This example uses the following data that represents unit sales of tickets to 
events that were sold over a period of several days in December:
  
  ### ticket_sales.json Contents
  
@@@ -150,7 -150,7 +151,7 @@@
          "12-21": 857475
        }
      }
--    
++
  Take a look at the data in Drill:
  
      
+---------+---------+---------------------------------------------------------------+
@@@ -180,7 -180,7 +181,7 @@@ KVGEN allows queries against maps wher
  
  FLATTEN breaks the list of key-value pairs into separate rows on which you 
can apply analytic functions. FLATTEN takes a JSON array, such as the output 
from kvgen(sales), as an argument. Using the all (*) wildcard as the argument 
is not supported and returns an error. The following example continues using 
data from the [previous 
example]({{site.baseurl}}/docs/json-data-model/#example:-flatten-and-generate-key-values-for-complex-json):
  
--    SELECT FLATTEN(kvgen(sales)) Sales 
++    SELECT FLATTEN(kvgen(sales)) Sales
      FROM dfs.`/Users/drilluser/drill/ticket_sales.json`;
  
      +--------------------------------+
@@@ -198,10 -198,10 +199,10 @@@
      8 rows selected (0.171 seconds)
  
  ### Example: Aggregate Loosely Structured Data
--Use flatten and kvgen together to aggregate the data from the [previous 
example]({{site.baseurl}}/docs/json-data-model/#example:-flatten-and-generate-key-values-for-complex-json).
 Make sure all text mode is set to false to sum numbers. Drill returns an error 
if you attempt to sum data in all text mode. 
++Use flatten and kvgen together to aggregate the data from the [previous 
example]({{site.baseurl}}/docs/json-data-model/#example:-flatten-and-generate-key-values-for-complex-json).
 Make sure all text mode is set to false to sum numbers. Drill returns an error 
if you attempt to sum data in all text mode.
  
      ALTER SYSTEM SET `store.json.all_text_mode` = false;
--    
++
  Sum the ticket sales by combining the `SUM`, `FLATTEN`, and `KVGEN` functions 
in a single query.
  
      SELECT SUM(tkt.tot_sales.`value`) AS TicketSold FROM (SELECT 
flatten(kvgen(sales)) tot_sales FROM dfs.`/Users/drilluser/ticket_sales.json`) 
tkt;
@@@ -214,13 -214,13 +215,17 @@@
      1 row selected (0.244 seconds)
  
  ### Example: Aggregate and Sort Data
- Sum the ticket sales by state and group by day and sort in ascending order. 
 -Sum and group the ticket sales by date and sort in ascending order of total 
tickets sold. 
--
--    SELECT `right`(tkt.tot_sales.key,2) `December Date`, 
--    SUM(tkt.tot_sales.`value`) AS TotalSales 
--    FROM (SELECT FLATTEN(kvgen(sales)) tot_sales 
--    FROM dfs.`/Users/drilluser/ticket_sales.json`) tkt 
--    GROUP BY `right`(tkt.tot_sales.key,2) 
++<<<<<<< HEAD
++Sum the ticket sales by state and group by day and sort in ascending order.
++=======
++Sum and group the ticket sales by date and sort in ascending order of total 
tickets sold.
++>>>>>>> d6f216a60b04b5366a3f3905450988597a421118
++
++    SELECT `right`(tkt.tot_sales.key,2) `December Date`,
++    SUM(tkt.tot_sales.`value`) AS TotalSales
++    FROM (SELECT FLATTEN(kvgen(sales)) tot_sales
++    FROM dfs.`/Users/drilluser/ticket_sales.json`) tkt
++    GROUP BY `right`(tkt.tot_sales.key,2)
      ORDER BY TotalSales;
  
      +----------------+-------------+
@@@ -240,30 -240,30 +245,30 @@@ To access a map field in an array, use 
      {
        "type": "FeatureCollection",
        "features": [
--      { 
--        "type": "Feature", 
--        "properties": 
--        { 
--          "MAPBLKLOT": "0001001", 
--          "BLKLOT": "0001001", 
--          "BLOCK_NUM": "0001", 
--          "LOT_NUM": "001", 
--          "FROM_ST": "0", 
--          "TO_ST": "0", 
--          "STREET": "UNKNOWN", 
--          "ST_TYPE": null, 
--          "ODD_EVEN": "E" }, 
--          "geometry": 
--        { 
--            "type": "Polygon", 
--            "coordinates": 
--            [ [ 
--            [ -122.422003528252475, 37.808480096967251, 0.0 ], 
--            [ -122.422076013325281, 37.808835019815085, 0.0 ], 
--            [ -122.421102174348633, 37.808803534992904, 0.0 ], 
--            [ -122.421062569067274, 37.808601056818148, 0.0 ], 
--            [ -122.422003528252475, 37.808480096967251, 0.0 ] 
--            ] ] 
++      {
++        "type": "Feature",
++        "properties":
++        {
++          "MAPBLKLOT": "0001001",
++          "BLKLOT": "0001001",
++          "BLOCK_NUM": "0001",
++          "LOT_NUM": "001",
++          "FROM_ST": "0",
++          "TO_ST": "0",
++          "STREET": "UNKNOWN",
++          "ST_TYPE": null,
++          "ODD_EVEN": "E" },
++          "geometry":
++        {
++            "type": "Polygon",
++            "coordinates":
++            [ [
++            [ -122.422003528252475, 37.808480096967251, 0.0 ],
++            [ -122.422076013325281, 37.808835019815085, 0.0 ],
++            [ -122.421102174348633, 37.808803534992904, 0.0 ],
++            [ -122.421062569067274, 37.808601056818148, 0.0 ],
++            [ -122.422003528252475, 37.808480096967251, 0.0 ]
++            ] ]
          }
        },
      . . .
@@@ -281,7 -281,7 +286,7 @@@ This example shows how to drill down us
  
  To access the second geometry coordinate of the first city lot in the San 
Francisco city lots, use array indexing notation for the coordinates as well as 
the features:
  
--    SELECT features[0].geometry.coordinates[0][1] 
++    SELECT features[0].geometry.coordinates[0][1]
      FROM dfs.`/Users/drilluser/citylots.json`;
      +-------------------+
      |      EXPR$0       |
@@@ -290,10 -290,10 +295,10 @@@
      +-------------------+
      1 row selected (0.19 seconds)
  
--More examples of drilling down into an array are shown in ["Selecting Nested 
Data for a Column"]({{ site.baseurl 
}}/docs/selecting-nested-data-for-a-column). 
++More examples of drilling down into an array are shown in ["Selecting Nested 
Data for a Column"]({{ site.baseurl }}/docs/selecting-nested-data-for-a-column).
  
  ### Example: Flatten an Array of Maps using a Subquery
--By flattening the following JSON file, which contains an array of maps, you 
can evaluate the records of the flattened data. 
++By flattening the following JSON file, which contains an array of maps, you 
can evaluate the records of the flattened data.
  
      {"name":"classic","fillings":[ {"name":"sugar","cal":500} , 
{"name":"flour","cal":300} ] }
  
@@@ -346,8 -346,8 +351,8 @@@ This example uses a WHERE clause to dri
  
  Use dot notation, for example `t.birth.lastname` and `t.birth.bearer.max_hdl` 
to drill down to the nested level:
  
--    SELECT t.birth.lastname AS Name, t.birth.weight AS Weight 
--    FROM dfs.`Users/drilluser/vitalstat.json` t 
++    SELECT t.birth.lastname AS Name, t.birth.weight AS Weight
++    FROM dfs.`Users/drilluser/vitalstat.json` t
      WHERE t.birth.bearer.max_hdl < 160;
  
      +----------------+------------+
@@@ -367,7 -367,7 +372,7 @@@ In most cases, you can use a workaround
  * Complex JSON objects
  * Nested column names
  * Schema changes
--* Selecting all in a JSON directory query 
++* Selecting all in a JSON directory query
  
  ### Array at the root level
  Drill cannot read an array at the root level, outside an object.
@@@ -401,7 -401,7 +406,7 @@@ Workaround: To query n-level nested dat
      }
      . . .
  
--    SELECT dev_id, `date`, `time`, t.user_info.user_id, t.user_info.device, 
t.dev_info.prod_id 
++    SELECT dev_id, `date`, `time`, t.user_info.user_id, t.user_info.device, 
t.dev_info.prod_id
      FROM dfs.`/Users/mypath/example.json` t;
  
  ### Empty array
@@@ -409,7 -409,7 +414,7 @@@ Drill cannot read an empty array, show
  
          { "a":[] }
  
--Workaround: Remove empty arrays. 
++Workaround: Remove empty arrays.
  
  For example, you cannot query the [City Lots San Francisco in 
.json](https://github.com/zemirco/sf-city-lots-json) data unless you make the 
following modification.
  
@@@ -418,7 -418,7 +423,7 @@@
  After removing the extraneous square brackets in the coordinates array, you 
can drill down to query all the data for the lots.
  
  ### Lengthy JSON objects
--Currently, Drill cannot manage lengthy JSON objects, such as a gigabit JSON 
file. Finding the beginning and end of records can be time consuming and 
require scanning the whole file. 
++Currently, Drill cannot manage lengthy JSON objects, such as a gigabit JSON 
file. Finding the beginning and end of records can be time consuming and 
require scanning the whole file.
  
  Workaround: Use a tool to split the JSON file into smaller chunks of 64-128MB 
or 64-256MB initially until you know the total data size and node 
configuration. Keep the JSON objects intact in each file. A distributed file 
system, such as MapR-FS, is recommended over trying to manage file partitions.
  
@@@ -426,13 -426,13 +431,13 @@@
  Complex arrays and maps can be difficult or impossible to query.
  
  Workaround: Separate lengthy objects into objects delimited by curly braces 
using the following functions:
-- 
++
  * [FLATTEN]({{ site.baseurl }}/docs/json-data-model#flatten-json-data) 
separates a set of nested JSON objects into individual rows in a DRILL table.
  
  * [KVGEN]({{ site.baseurl }}/docs/kvgen/) separates objects having more 
elements than optimal for querying.
  
--  
--### Nested Column Names 
++
++### Nested Column Names
  
  You cannot use reserved words for nested column names because Drill returns 
null if you enclose n-level nested column names in back ticks. The previous 
example encloses the date and time column names in back ticks because the names 
are reserved words. The enclosure of column names in back ticks works because 
the date and time columns belong to the first level of the JSON object.
  
@@@ -457,8 -457,8 +462,8 @@@ Drill cannot read JSON files containin
  
  ![drill query flow]({{ site.baseurl }}/docs/img/data-sources-schemachg.png)
  
--Drill interprets numbers that do not have a decimal point as BigInt values. 
In this example, Drill recognizes the first two coordinates as doubles and the 
third coordinate as a BigInt, which causes an error. 
--                
++Drill interprets numbers that do not have a decimal point as BigInt values. 
In this example, Drill recognizes the first two coordinates as doubles and the 
third coordinate as a BigInt, which causes an error.
++
  Workaround: Set the `store.json.read_numbers_as_double` property, described 
earlier, to true.
  
      ALTER SYSTEM SET `store.json.read_numbers_as_double` = true;
@@@ -467,9 -467,9 +472,3 @@@
  Drill currently returns only fields common to all the files in a [directory 
query]({{ site.baseurl }}/docs/querying-directories) that selects all (SELECT 
*) JSON files.
  
  Workaround: Query each file individually.
--
--
--
--
--
--

[23/26] drill git commit: Merge branch 'gh-pages' of https://github.com/tshiran/drill into gh-pages

Reply via email to