Andries Engelbrecht created DRILL-2265:
------------------------------------------
Summary: Drill data exploration function for complex data types
Key: DRILL-2265
URL: https://issues.apache.org/jira/browse/DRILL-2265
Project: Apache Drill
Issue Type: Improvement
Components: Functions - Drill
Reporter: Andries Engelbrecht
Assignee: Daniel Barclay (Drill/MapR)
Drill data exploration function for complex data types
When dealing with complex data in large volumes it will be extremely useful to
have a function to collect metadata to provide a better view of the total data
set.
If JSON is used as an example a data set can have an extremely large volume of
JSON objects. Each object can have multiple schemas and subschemas with
multiple nested subschemas as well as arrays. Not all objects will have all of
the schemas or subschemas. When exploring this data in Drill a SQL dot notation
is used to navigate the complex subschema structure, and it can become very
cumbersome to fully understand the total picture of all the data.
A function that can explore the JSON objects in a data set (whether single file
with multiple objects, single or multilevel directory structure) and provide
the total structure of all the JSON objects to show all schema, subschema and
arrays that are available for all the JSON objects. This way a data analyst
will be able to see within the data set all the schema data that is available.
Additionally if the function can provide the statistics information to show how
many of the objects actually contain each of the schemas, subschemas and arrays
(and data in each), this may indicate to an analyst how valuable or important
in may be to explore any subschema or array.
To speed up the collection of this data, the function may contain an option to
set a sample size to only sample a portion of the total volume and project the
total data set. This is a very common operation being used with prominent RDBMS
systems today. Additionally for data that changes or grows the metadata
collection function will need to be run periodically to update the statistics.
To make the metadata more useful the results should be considered to be placed
in a Drill metadata structure, similar to INFORMATION_SCHEMA, but specifically
for statistics metadata only to be used by analysts for data exploration. Some
security considerations should also be deigned to only allow access to users
with access to the base data.
In addition to the use for data analyst and data exploration the metadata and
statistics can also be used for Drill internal functions in the future, such as
query optimization and creation of views.
This example specifically focusses on JSON data, but can similarly be applied
to other complex data types that may require a very detailed understanding of
the complex data set.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)