[GitHub] [incubator-superset] john-bodley opened issue #5842: [SIP-6] Ensuring Superset accurately represents the underlying data warehouse

GitHub Fri, 07 Sep 2018 11:23:15 -0700

[SIP] Proposal for ensuring Superset accurately represents the underlying data 
warehouse

## Motivation

Over time there can be a discrepancy between the actual metadata of a Druid
datasource or SQLA table and the corresponding Superset representation. Columns
or metrics can be added, updated, or deleted, and thus overtime creep grows
between the systems. The proposal is to try to add functionality to better
ensure consistency between Superset and the underlying data warehouses which
should help boost user confidence.

## Proposed Change

Currently there is a mechanism to refresh Druid metadata per the following menu
item:

![screen shot 2018-09-07 at 10 48 34
am](https://user-images.githubusercontent.com/4567245/45234785-9d0eef00-b28b-11e8-961b-0de4daf961ba.png)

thus the plan would be to also include an option for refreshing SQLA table
metadata. We could optionally also provide a mechanism to scan for new SQLA
tables though we should be aware that there scale of datasources (and their
corresponding columns/metrics) could negatively impact the performance of
Superset.

### Refreshing

Regarding the term "refreshing" I proposed the following is instrumented for
both Druid and SQLA datasources.

#### Columns/Metrics

- New entities are added
- Existing entities are updated
- Obsolete entities are deleted†

†Note derived entities which reference an obsolete entity will also be deleted.

#### Datasources

An unanswered question remains about what should happen when a Druid or SQLA
datasource is defined in Superset but no longer resides in the underlying data
warehouse. One school of thought is that if the underlying Druid datasource or
SQLA table no longer exists we should:

- Delete all slices which reference the datasource
- Delete all empty dashboards (if appropriate)†
- Delete the datasource

†Note I'm unsure what the current logic is for having a dashboard with no
slices.

The concern with this approach is it is a fairly destructive process, i.e.,
significant IP is potentially lost when one deletes a slice. Would it make more
sense that the underlying institution controls when/how datasources are
deleted? For example one could define a policy that if said datasource has been
deleted (and not restored) for _n_ consecutive days then it's probably safe to
delete it from Superset.

## New dependencies

The following existing PRs are required which ensures we have uniqueness at the
datasource/column/metric level:

- https://github.com/apache/incubator-superset/pull/5445
- https://github.com/apache/incubator-superset/pull/5449
- https://github.com/apache/incubator-superset/pull/5451
- https://github.com/apache/incubator-superset/pull/5452
- https://github.com/apache/incubator-superset/pull/5453

## Open Questions

1. Should we delete datasources which no longer exist in the data warehouse?
2. How do we deal with custom SQL definitions when the underlying datasource
changes? I suspect this is out-of-scope.

to: @betodealmeida @michellethomas @mistercrunch @timifasubaa

[ Full content available at:
https://github.com/apache/incubator-superset/issues/5842 ]
This message was relayed via gitbox.apache.org for [email protected]

[GitHub] [incubator-superset] john-bodley opened issue #5842: [SIP-6] Ensuring Superset accurately represents the underlying data warehouse

Reply via email to