[
https://issues.apache.org/jira/browse/DRILL-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Parth Chandra updated DRILL-2265:
---------------------------------
Fix Version/s: 0.9.0
> Drill data exploration function for complex data types
> ------------------------------------------------------
>
> Key: DRILL-2265
> URL: https://issues.apache.org/jira/browse/DRILL-2265
> Project: Apache Drill
> Issue Type: Improvement
> Components: Functions - Drill
> Reporter: Andries Engelbrecht
> Assignee: Daniel Barclay (Drill/MapR)
> Fix For: 0.9.0
>
>
> Drill data exploration function for complex data types
> When dealing with complex data in large volumes it will be extremely useful
> to have a function to collect metadata to provide a better view of the total
> data set.
> If JSON is used as an example a data set can have an extremely large volume
> of JSON objects. Each object can have multiple schemas and subschemas with
> multiple nested subschemas as well as arrays. Not all objects will have all
> of the schemas or subschemas. When exploring this data in Drill a SQL dot
> notation is used to navigate the complex subschema structure, and it can
> become very cumbersome to fully understand the total picture of all the data.
> A function that can explore the JSON objects in a data set (whether single
> file with multiple objects, single or multilevel directory structure) and
> provide the total structure of all the JSON objects to show all schema,
> subschema and arrays that are available for all the JSON objects. This way a
> data analyst will be able to see within the data set all the schema data that
> is available. Additionally if the function can provide the statistics
> information to show how many of the objects actually contain each of the
> schemas, subschemas and arrays (and data in each), this may indicate to an
> analyst how valuable or important in may be to explore any subschema or array.
> To speed up the collection of this data, the function may contain an option
> to set a sample size to only sample a portion of the total volume and project
> the total data set. This is a very common operation being used with prominent
> RDBMS systems today. Additionally for data that changes or grows the metadata
> collection function will need to be run periodically to update the statistics.
> To make the metadata more useful the results should be considered to be
> placed in a Drill metadata structure, similar to INFORMATION_SCHEMA, but
> specifically for statistics metadata only to be used by analysts for data
> exploration. Some security considerations should also be deigned to only
> allow access to users with access to the base data.
> In addition to the use for data analyst and data exploration the metadata and
> statistics can also be used for Drill internal functions in the future, such
> as query optimization and creation of views.
> This example specifically focusses on JSON data, but can similarly be applied
> to other complex data types that may require a very detailed understanding of
> the complex data set.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)