john-bodley opened a new issue #7656: [SIP-21] Proposal for inferring temporal formatting and parsing URL: https://github.com/apache/incubator-superset/issues/7656 ## [SIP-21] Proposal for inferring temporal formatting and parsing ### Motivation In [SIP-15](https://github.com/apache/incubator-superset/issues/6360) we surfaced examples were lexicographical sorting could result in incorrect time intervals. Initially the proposal was to ensure all temporal fields were cast to a timestamp and filter comparisons were between the appropriate timestamp (or similar) types, i.e., DATE_PARSE(ds, '%Y-%m-%d') >= TIMESTAMP '2018-01-01 00:00:00' Additionally if all temporal fields were transformed to a timestamp the time grains would work correctly (the current logic assumes that said type can be successfully cast to a timestamp). @mistercrunch correctly pointed out out that not all query planners would be able to rewrite these filters to take advantage of indexes (if present) resulting in sub-performant queries and thus the status quo of ensuring the _left-hand-side_ (LHS) of the filter comparison remains unchanged and that the _right-hand-side_ (RHS) formats the Python `datetime` object` appropriately, i.e., ds >= '2018-01-01' Currently there are two problems with the current functionality: 1. One must _explicitly_ define the format of the temporal column for non date/date-time like types, i.e., strings and numbers as the later supports multiple temporal encodings, e.g. `YYYY-MM-DD`, `YYYY-DD-MM`, epoch timestamp (in seconds), epoch timestamp (in milliseconds), etc. 2. The time grains incorrectly assume that the temporal column can simply be cast to timestamp (or equivalent) type. Really this can be seen as two conversions: 1. Converting a Python `datetime` object into the appropriate database type. 2. Converting a database type into a time-stamp (or equivalent) which is necessary for the time grain transformations. Note that the `convert_dttm` handles both of these already for date/date-time like types so the problem really lies with string like and numeric types which have temporal encoding. ### Proposed Change We propose the following to address the two problems outlined above: #### Format Inferencing Rather than having to explicitly define the format for all non date/date-time temporal columns it would be great to infer this from a sample. There are a few Python libraries (`arrow`, `datetutil`, `maya`, etc.) which can parse non a priori defined date-time formats, i.e., >>> from dateutil.parser import parse >>> parse('2019-05-31') datetime.datetime(2019, 5, 31, 0, 0) Sadly none of these libraries will provide the underlying format though there are [ways](https://stackoverflow.com/questions/53892450/get-the-format-in-dateutil-parse) of inferring it. Note a single value could be expressed by multiple formats. Regarding epoch timestamps which can be defined via integers (representing seconds or milliseconds) or floats, To differentiate between these one could use basis logic (which generally works) like [www.epochconverter.com](https://www.epochconverter.com) where if there are less than 12 digits the timestamp is assumed to be in seconds, otherwise it is milliseconds (see [here](https://stackoverflow.com/questions/23929145/how-to-test-if-a-given-time-stamp-is-in-seconds-or-milliseconds) for detail). We propose the following solution: 1. Add a database specific column name/type mapping for non-explicit temporal columns, i.e., we use the `ds` VARCHAR column to represent a date-stamp. 2. Whenever a SQLAlchemy column is: - marked as temporal and the column type is not explicitly a temporal type - and the column name/type isn't in the mapping or is no-temporal we sample the column (using say 100 values) and try to infer the best format. 3. Rather than using a free-form text box the `python_date_format` field should represent a selector with the various types where either the mapped or best format is selected. This allows users to override the format if the inferencing was incorrect. Why a drop-down and not pre-populated free-form text? Mostly because not all formats can be represented by Python's `strftime` [formats](https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior), i.e., one cannot express an epoch timestamp (`%s` only works for seconds and should be avoided as it's time zone specific). #### Time Grains Given that the format of the type is already inferred we simply need to provide at the engine level in [db_engine_specs.py](https://github.com/apache/incubator-superset/blob/master/superset/db_engine_specs.py) functionality (by ways of a SQL expression) to map from a string or numerical type to a time-stamp which will then be wrapped inside of the SQL expressions representing the `time_grain_functions`, i.e., for the example of a date encoded as a string, for the Presto engine we would use the `date_parse(string, format)` UDF. There is still an open question about which format encoding we should use as we postulate that not every database engine uses the same formatting. ### New or Changed Public Interfaces N/A. ### New dependencies Depends on what Python package we use for inferring the date-time format. ### Migration Plan and Compatibility A migration would be needed to: 1. Add support for the database specific column name/type mapping. 2. Re-encoding the `table_columns.python_date_format` column. ### Rejected Alternatives See SIP-15. to: @betodealmeida @mistercrunch @michellethomas @villebro
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
