[
https://issues.apache.org/jira/browse/DRILL-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626098#comment-16626098
]
Vitalii Diravka commented on DRILL-6552:
----------------------------------------
We consider to implement sampling and to allow this kind of queries:
{code:java}
ANALYZE TABLE employees ESTIMATE STATISTICS SAMPLE 100 ROWS;
{code}
However I think it isn't reliable and also difficult to use it in the planning
stage properly for all cases. It could be considered as optional feature.
But NDV and histograms look very interesting.
Looks like Hive supports only histogram UDFs for now, storing histograms in
Hive Metastore is considered as future enchantment:
https://issues.apache.org/jira/browse/HIVE-3526
[https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive#Cost-basedoptimizationinHive-STATS]
For now in Drill it can be stored in such way as [~vvysotskyi] suggested.
{quote}An even better approach is to keep track of which files have been
scanned and avoid scanning them again. (At least on HDFS and S3, files are
immutable.)
{quote}
We can use {{COMPUTE INCREMENTAL STATISTICS}} to update metadata for only newly
added files and partitions:
{code:java}
ANALYZE TABLE employees COMPUTE INCREMENTAL STATISTICS;{code}
Something similar Impala does.
{quote}Would really be cool to scan files as they arrive, though this is beyond
the scope of Drill, would need some daemon that is triggered by file arrival in
HDFS.
{quote}
Possibly it can be considered as future enhancement of Drill Metastore or maybe
even more reasonably to implement it in Hive Metastore via scheduling ANALYZE
TABLE commands:
[https://www.qubole.com/blog/automatic-statistics-collection-better-query-performance/]
> Drill Metadata management "Drill MetaStore"
> -------------------------------------------
>
> Key: DRILL-6552
> URL: https://issues.apache.org/jira/browse/DRILL-6552
> Project: Apache Drill
> Issue Type: New Feature
> Components: Metadata
> Affects Versions: 1.13.0
> Reporter: Vitalii Diravka
> Assignee: Vitalii Diravka
> Priority: Major
> Fix For: 2.0.0
>
>
> It would be useful for Drill to have some sort of metastore which would
> enable Drill to remember previously defined schemata so Drill doesn’t have to
> do the same work over and over again.
> It allows to store schema and statistics, which will allow to accelerate
> queries validation, planning and execution time. Also it increases stability
> of Drill and allows to avoid different kind if issues: "schema change
> Exceptions", "limit 0" optimization and so on.
> One of the main candidates is Hive Metastore.
> Starting from 3.0 version Hive Metastore can be the separate service from
> Hive server:
> [https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+3.0+Administration]
> Optional enhancement is storing Drill's profiles, UDFs, plugins configs in
> some kind of metastore as well.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)