john-bodley opened a new issue #7656: [SIP-21] Proposal for inferring temporal 
formatting and parsing
URL: https://github.com/apache/incubator-superset/issues/7656
 
 
   ## [SIP-21] Proposal for inferring temporal formatting and parsing
   
   ### Motivation
   
   In [SIP-15](https://github.com/apache/incubator-superset/issues/6360) we 
surfaced examples were lexicographical sorting could result in incorrect time 
intervals. Initially the proposal was to ensure all temporal fields were cast 
to a timestamp and filter comparisons were between the appropriate timestamp 
(or similar) types, i.e., 
   
       DATE_PARSE(ds, '%Y-%m-%d') >= TIMESTAMP '2018-01-01 00:00:00'
   
   Additionally if all temporal fields were transformed to a timestamp the time 
grains would work correctly (the current logic assumes that said type can be 
successfully cast to a timestamp). 
   
   @mistercrunch correctly pointed out out that not all query planners would be 
able to rewrite these filters to take advantage of indexes (if present) 
resulting in sub-performant queries and thus the status quo of ensuring the 
_left-hand-side_ (LHS) of the filter comparison remains unchanged and that the 
_right-hand-side_ (RHS) formats the Python `datetime` object` appropriately, 
i.e., 
   
       ds >= '2018-01-01'
   
   Currently there are two problems with the current functionality: 
   
   1. One must _explicitly_ define the format of the temporal column for non 
date/date-time like types, i.e., strings and numbers as the later supports 
multiple temporal encodings, e.g. `YYYY-MM-DD`, `YYYY-DD-MM`, epoch timestamp 
(in seconds), epoch timestamp (in milliseconds), etc. 
   
   2. The time grains incorrectly assume that the temporal column can simply be 
cast to timestamp (or equivalent) type.  
   
   Really this can be seen as two conversions: 
   
   1. Converting a Python `datetime` object into the appropriate database type. 
   2. Converting a database type into a time-stamp (or equivalent) which is 
necessary for the time grain transformations. 
   
   Note that the `convert_dttm` handles both of these already for 
date/date-time like types so the problem really lies with string like and 
numeric types which have temporal encoding.
   
   ### Proposed Change
   
   We propose the following to address the two problems outlined above:
   
   #### Format Inferencing 
   
   Rather than having to explicitly define the format for all non 
date/date-time temporal columns it would be great to infer this from a sample. 
There are a few Python libraries (`arrow`, `datetutil`, `maya`, etc.) which can 
parse non a priori defined date-time formats, i.e., 
   
       >>> from dateutil.parser import parse
       >>> parse('2019-05-31')
       datetime.datetime(2019, 5, 31, 0, 0)
   
   Sadly none of these libraries will provide the underlying format though 
there are 
[ways](https://stackoverflow.com/questions/53892450/get-the-format-in-dateutil-parse)
 of inferring it. Note a single value could be expressed by multiple formats.
   
   Regarding epoch timestamps which can be defined via integers (representing 
seconds or milliseconds) or floats, To differentiate between these one could 
use basis logic (which generally works) like 
[www.epochconverter.com](https://www.epochconverter.com) where if there are 
less than 12 digits the timestamp is assumed to be in seconds, otherwise it is 
milliseconds (see 
[here](https://stackoverflow.com/questions/23929145/how-to-test-if-a-given-time-stamp-is-in-seconds-or-milliseconds)
 for detail). 
   
   We propose the following solution: 
   
   1. Add a database specific column name/type mapping for non-explicit 
temporal columns, i.e., we use the `ds` VARCHAR column to represent a 
date-stamp.
   
   2. Whenever a SQLAlchemy column is: 
       - marked as temporal and the column type is not explicitly a temporal 
type 
       - and the column name/type isn't in the mapping or is no-temporal
   
        we sample the column (using say 100 values) and try to infer the best 
format. 
   
   3. Rather than using a free-form text box the `python_date_format` field 
should represent a selector with the various types where either the mapped or 
best format is selected. This allows users to override the format if the 
inferencing was incorrect. Why a drop-down and not pre-populated free-form 
text? Mostly because not all formats can be represented by Python's `strftime` 
[formats](https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior),
 i.e., one cannot express an epoch timestamp (`%s` only works for seconds and 
should be avoided as it's time zone specific).
   
   #### Time Grains 
   
   Given that the format of the type is already inferred we simply need to 
provide at the engine level in 
[db_engine_specs.py](https://github.com/apache/incubator-superset/blob/master/superset/db_engine_specs.py)
 functionality (by ways of a SQL expression) to map from a string or numerical 
type to a time-stamp which will then be wrapped inside of the SQL expressions 
representing the `time_grain_functions`, i.e., for the example of a date 
encoded as a string, for the Presto engine we would use the `date_parse(string, 
format)` UDF.
   
   There is still an open question about which format encoding we should use as 
we postulate that not every database engine uses the same formatting. 
   
   ### New or Changed Public Interfaces
   
   N/A.
   
   ### New dependencies
   
   Depends on what Python package we use for inferring the date-time format.
   
   ### Migration Plan and Compatibility
   
   A migration would be needed to: 
   
   1. Add support for the database specific column name/type mapping. 
   2. Re-encoding the `table_columns.python_date_format` column.
   
   ### Rejected Alternatives
   
   See SIP-15. 
   
   to: @betodealmeida @mistercrunch @michellethomas @villebro 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to