Chris Mathews created DRILL-3866:
------------------------------------
Summary: Parquet schema details not being utilised for metadata
information
Key: DRILL-3866
URL: https://issues.apache.org/jira/browse/DRILL-3866
Project: Apache Drill
Issue Type: Improvement
Components: Metadata, Storage - Parquet
Affects Versions: 1.1.0, 1.2.0
Environment: CentOS release 6.3 (Final)
Java jdk1.7.0_79
apache-drill-1.1.0
apache-hadoop-2.7.1
apache-zookeeper-3.4.6
Reporter: Chris Mathews
To access parquet files using Tableau, Drill must be configured with individual
views for each parquet schema, and every column cast to specific data types
before Tableau can access the data correctly, or for that matter even see the
list of available tables.
Understandably, this is a necessary requirement for other file formats which do
not persist schema information, since Drill does not know the data types for
any fields until the query is executed, but why for parquet files ?
Having defined AVRO schemas for each parquet file in the AvroParquetWriter
phase, and the parquet files storing the schema as part of the data, couldn't
Drill leverage the information from the footers and make it available to
reporting tools ?
Also, as part of these investigations some parquet files were created using
CTAS. The directory is created and the files contain the data but the tables do
not seem to be displayed when we do a {{SHOW TABLES}} command. Shouldn't the
metadata also be available for these tables ?
I understand that with the new *{{REFRESH TABLE METADATA}}* feature Drill
collects all the information from the parquet footers and store it in a cache
file, but even in this case Drill does not seem to leverage this information to
provide metadata to reporting tools such as Tableau.
I know there have been discussions around this in the past but I could not find
a Jira for this specific use-case.
_My thanks to *Rahul Challapalli of MapR Technologies* for his help here._
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)