[ 
https://issues.apache.org/jira/browse/DRILL-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parth Chandra updated DRILL-2265:
---------------------------------
    Fix Version/s:     (was: 0.9.0)
                   Future

> Drill data exploration function for complex data types
> ------------------------------------------------------
>
>                 Key: DRILL-2265
>                 URL: https://issues.apache.org/jira/browse/DRILL-2265
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Functions - Drill
>            Reporter: Andries Engelbrecht
>            Assignee: Daniel Barclay (Drill)
>             Fix For: Future
>
>
> Drill data exploration function for complex data types
> When dealing with complex data in large volumes it will be extremely useful 
> to have a function to collect metadata to provide a better view of the total 
> data set.
> If JSON is used as an example a data set can have an extremely large volume 
> of JSON objects. Each object can have multiple schemas and subschemas with 
> multiple nested subschemas as well as arrays. Not all objects will have all 
> of the schemas or subschemas. When exploring this data in Drill a SQL dot 
> notation is used to navigate the complex subschema structure, and it can 
> become very cumbersome to fully understand the total picture of all the data.
> A function that can explore the JSON objects in a data set (whether single 
> file with multiple objects, single or multilevel directory structure) and 
> provide the total structure of all the JSON objects to show all schema, 
> subschema and arrays that are available for all the JSON objects. This way a 
> data analyst will be able to see within the data set all the schema data that 
> is available. Additionally if the function can provide the statistics 
> information to show how many of the objects actually contain each of the 
> schemas, subschemas and arrays (and data in each), this may indicate to an 
> analyst how valuable or important in may be to explore any subschema or array.
> To speed up the collection of this data, the function may contain an option 
> to set a sample size to only sample a portion of the total volume and project 
> the total data set. This is a very common operation being used with prominent 
> RDBMS systems today. Additionally for data that changes or grows the metadata 
> collection function will need to be run periodically to update the statistics.
> To make the metadata more useful the results should be considered to be 
> placed in a Drill metadata structure, similar to INFORMATION_SCHEMA, but 
> specifically for statistics metadata only to be used by analysts for data 
> exploration. Some security considerations should also be deigned to only 
> allow access to users with access to the base data.
> In addition to the use for data analyst and data exploration the metadata and 
> statistics can also be used for Drill internal functions in the future, such 
> as query optimization and creation of views.
> This example specifically focusses on JSON data, but can similarly be applied 
> to other complex data types that may require a very detailed understanding of 
> the complex data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to