[jira] [Commented] (DRILL-7567) Metastore enhancements

Vova Vysotskyi (Jira) Tue, 04 Feb 2020 07:33:57 -0800


    [ 
https://issues.apache.org/jira/browse/DRILL-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029938#comment-17029938
 ]


Vova Vysotskyi commented on DRILL-7567:
---------------------------------------

+1 regarding providing a schema for Metastore. It will be implemented soon in 
the scope of DRILL-7477.

> Metastore enhancements
> ----------------------
>
>                 Key: DRILL-7567
>                 URL: https://issues.apache.org/jira/browse/DRILL-7567
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Paul Rogers
>            Priority: Major
>
> The Metastore feature shipped as a Beta. Review of the documentation 
> identified a number of opportunities for improvement before the feature 
> leaves Beta.
> * Should the Metastore be configured in its own file? Does this push us in 
> the direction of each feature having its own set of config files? Or, should 
> config move into the normal Drill config files?
> * Provide a detailed schema and description of Metadata entities, like the 
> Hive metadata schema.
> * Provide an out-of-the-box sample Metastore for some of Drills demo tables.
> * Provide a Metastore tutorial. Refer to the sample Metastore in the 
> tutorial. Many folks learn best by trying things hands-on.
> * Solve read/write consistency issues to avoid the need for the 
> error/recovery described for {{metastore.metadata.fallback_to_file_metadata}}.
> * Boot-time config is stored in the {{drill.metastore}} namespace. But, 
> Metastore SYSTEM/SESSION options are in the {{drill.exec}} namespace. This is 
> confusing. Let's be consistent.
> * {{drill.exec.storage.implicit.last_modified_time.column.label}} is a bug: 
> Drill internal names should never conflict with user-defined column names. 
> Figure out where they conflict the issue. No user can ever guarantee that 
> some name will never be used in their tables. Nor can users easily fix the 
> issue if it occurs. (Note: this is a flaw with our implicit columns as well.)
> * Provide a form of ANALYZE TABLE that automatically reuses settings from any 
> previous run. It will otherwise be very user unfriendly for the user to have 
> to find a place to store the ANALYZE TABLE command so that they can submit 
> exactly the same one each time. In fact, experience with Impala suggests that 
> end users will have no idea about schema, they just want the latest metadata. 
> Such users won't even know the details of a command some other user might 
> have submitted.
> * The Iceberg metastore requires atomic rename. But, the most common use case 
> for Drill today is the cloud. S3 does not support atomic rename. We need to 
> fix this.
> * The documentation says we us the "plugin name" as part of the table key. 
> But, for DFS, say, the user can have dozens of plugin configs, each with a 
> distinct name. Each can reuse the same workspace name of, say "foo". Thus 
> "dfs/foo" is ambiguous. But, "hdfs1/foo", and "local/foo" are unique if we 
> use storage plugin config names.
> * It is not clear if the Iceberg metastore supports HDFS security and 
> Kerberos tickets. If not, then it won't work in a production deployment.
> * The metastore is meant to store schema. A key use is when schema is 
> ambiguous. But, metastore gathers schema the same way that Drill queries 
> tables. If schema is ambiguous, the ANALYZE TABLE will fail. Thus we do not 
> actually solve the ambiguous schema problem. We need a solution.
> * Better partition support. Drill has a long-standing usability issue that 
> users must do their own partition coding. If I want data from 2018-11 to 
> 2019-02 (one quarter worth of data), I have to write the very ugly
> {code:sql}
> WHERE (dir0 = 2018 AND dir1 >= 11)
>         OR (dir0 = 2019 AND dir1 <= 1)
> {code}
> With Hive/Impala/Presto I can just write:
> {code:sql}
> WHERE transDate IN ('2018-11-01', '2019-01-31')
> {code}
> * Allow staged gathering of stats. Allow me to first gather stats and review 
> them for quality before I have my users start using them. As it is, there is 
> no ability to gather them, enable the option for a session for testing, 
> verify that things work right, then turn it on for everyone. That is, in a 
> shared system, all heck can break loose in the current implementation.
> * Review the internal Metastore tables. See many comments about the structure 
> in the Metastore documentation PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7567) Metastore enhancements

Reply via email to