[GitHub] [incubator-superset] john-bodley opened a new issue #7656: [SIP-21] Proposal for inferring temporal formatting and parsing

GitBox Wed, 05 Jun 2019 15:19:45 -0700

john-bodley opened a new issue #7656: [SIP-21] Proposal for inferring temporal
formatting and parsing
URL: https://github.com/apache/incubator-superset/issues/7656

## [SIP-21] Proposal for inferring temporal formatting and parsing

### Motivation

In [SIP-15](https://github.com/apache/incubator-superset/issues/6360) we
surfaced examples were lexicographical sorting could result in incorrect time
intervals. Initially the proposal was to ensure all temporal fields were cast
to a timestamp and filter comparisons were between the appropriate timestamp
(or similar) types, i.e.,

DATE_PARSE(ds, '%Y-%m-%d') >= TIMESTAMP '2018-01-01 00:00:00'

Additionally if all temporal fields were transformed to a timestamp the time
grains would work correctly (the current logic assumes that said type can be
successfully cast to a timestamp).

@mistercrunch correctly pointed out out that not all query planners would be
able to rewrite these filters to take advantage of indexes (if present)
resulting in sub-performant queries and thus the status quo of ensuring the
_left-hand-side_ (LHS) of the filter comparison remains unchanged and that the
_right-hand-side_ (RHS) formats the Python `datetime` object` appropriately,
i.e.,

ds >= '2018-01-01'

Currently there are two problems with the current functionality:

1. One must _explicitly_ define the format of the temporal column for non
date/date-time like types, i.e., strings and numbers as the later supports
multiple temporal encodings, e.g. `YYYY-MM-DD`, `YYYY-DD-MM`, epoch timestamp
(in seconds), epoch timestamp (in milliseconds), etc.

2. The time grains incorrectly assume that the temporal column can simply be
cast to timestamp (or equivalent) type.

Really this can be seen as two conversions:

1. Converting a Python `datetime` object into the appropriate database type.
2. Converting a database type into a time-stamp (or equivalent) which is
necessary for the time grain transformations.

Note that the `convert_dttm` handles both of these already for
date/date-time like types so the problem really lies with string like and
numeric types which have temporal encoding.

### Proposed Change

We propose the following to address the two problems outlined above:

#### Format Inferencing

Rather than having to explicitly define the format for all non
date/date-time temporal columns it would be great to infer this from a sample.
There are a few Python libraries (`arrow`, `datetutil`, `maya`, etc.) which can
parse non a priori defined date-time formats, i.e.,

>>> from dateutil.parser import parse
>>> parse('2019-05-31')
datetime.datetime(2019, 5, 31, 0, 0)

Sadly none of these libraries will provide the underlying format though
there are
[ways](https://stackoverflow.com/questions/53892450/get-the-format-in-dateutil-parse)
of inferring it. Note a single value could be expressed by multiple formats.

Regarding epoch timestamps which can be defined via integers (representing
seconds or milliseconds) or floats, To differentiate between these one could
use basis logic (which generally works) like
[www.epochconverter.com](https://www.epochconverter.com) where if there are
less than 12 digits the timestamp is assumed to be in seconds, otherwise it is
milliseconds (see
[here](https://stackoverflow.com/questions/23929145/how-to-test-if-a-given-time-stamp-is-in-seconds-or-milliseconds)
for detail).

We propose the following solution:

1. Add a database specific column name/type mapping for non-explicit
temporal columns, i.e., we use the `ds` VARCHAR column to represent a
date-stamp.

2. Whenever a SQLAlchemy column is:
- marked as temporal and the column type is not explicitly a temporal
type
- and the column name/type isn't in the mapping or is no-temporal

we sample the column (using say 100 values) and try to infer the best
format.

3. Rather than using a free-form text box the `python_date_format` field
should represent a selector with the various types where either the mapped or
best format is selected. This allows users to override the format if the
inferencing was incorrect. Why a drop-down and not pre-populated free-form
text? Mostly because not all formats can be represented by Python's `strftime`
[formats](https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior),
i.e., one cannot express an epoch timestamp (`%s` only works for seconds and
should be avoided as it's time zone specific).

#### Time Grains

Given that the format of the type is already inferred we simply need to
provide at the engine level in
[db_engine_specs.py](https://github.com/apache/incubator-superset/blob/master/superset/db_engine_specs.py)
functionality (by ways of a SQL expression) to map from a string or numerical
type to a time-stamp which will then be wrapped inside of the SQL expressions
representing the `time_grain_functions`, i.e., for the example of a date
encoded as a string, for the Presto engine we would use the `date_parse(string,
format)` UDF.

There is still an open question about which format encoding we should use as
we postulate that not every database engine uses the same formatting.

### New or Changed Public Interfaces

N/A.

### New dependencies

Depends on what Python package we use for inferring the date-time format.

### Migration Plan and Compatibility

A migration would be needed to:

1. Add support for the database specific column name/type mapping.
2. Re-encoding the `table_columns.python_date_format` column.

### Rejected Alternatives

See SIP-15.

to: @betodealmeida @mistercrunch @michellethomas @villebro


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-superset] john-bodley opened a new issue #7656: [SIP-21] Proposal for inferring temporal formatting and parsing

Reply via email to