Chris Mathews created DRILL-3866:
------------------------------------

             Summary: Parquet schema details not being utilised for metadata 
information
                 Key: DRILL-3866
                 URL: https://issues.apache.org/jira/browse/DRILL-3866
             Project: Apache Drill
          Issue Type: Improvement
          Components: Metadata, Storage - Parquet
    Affects Versions: 1.1.0, 1.2.0
         Environment: CentOS release 6.3 (Final)
Java jdk1.7.0_79
apache-drill-1.1.0
apache-hadoop-2.7.1
apache-zookeeper-3.4.6
            Reporter: Chris Mathews


To access parquet files using Tableau, Drill must be configured with individual 
views for each parquet schema, and every column cast to specific data types 
before Tableau can access the data correctly, or for that matter even see the 
list of available tables.

Understandably, this is a necessary requirement for other file formats which do 
not persist schema information, since Drill does not know the data types for 
any fields until the query is executed, but why for parquet files ?

Having defined AVRO schemas for each parquet file in the AvroParquetWriter 
phase, and the parquet files storing the schema as part of the data, couldn't 
Drill leverage the information from the footers and make it available to 
reporting tools ?

Also, as part of these investigations some parquet files were created using 
CTAS. The directory is created and the files contain the data but the tables do 
not seem to be displayed when we do a {{SHOW TABLES}} command.  Shouldn't the 
metadata also be available for these tables ? 

I understand that with the new *{{REFRESH TABLE METADATA}}* feature Drill 
collects all the information from the parquet footers and store it in a cache 
file, but even in this case Drill does not seem to leverage this information to 
provide metadata to reporting tools such as Tableau.

I know there have been discussions around this in the past but I could not find 
a Jira for this specific use-case.

_My thanks to *Rahul Challapalli of MapR Technologies* for his help here._




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to