[ https://issues.apache.org/jira/browse/DRILL-4765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365189#comment-15365189 ]
Paul Rogers commented on DRILL-4765: ------------------------------------ Suggestion on solution: for Parquet, the Date logical type declares the meaning of the int32 field. The Parquet reader should do a Parquet-to-Drill conversion step for each field. For dates, that conversion means to change the Parquet date format to Drill's (by converting units and/or 0-point.) Ideally, the solution should be generic so it will work for Parquet interval types as well in the future. Also, Parquet logical types. That is, maybe a table of conversion functions, keyed by Parquet logical type. An advantage of this approach is that we can then easily support the "no-op" Parquet logical types of int_8, int_16, int_32, uint_8, and uint_16. > Missing, incorrect information in Drill data types page > ------------------------------------------------------- > > Key: DRILL-4765 > URL: https://issues.apache.org/jira/browse/DRILL-4765 > Project: Apache Drill > Issue Type: Improvement > Components: Documentation > Affects Versions: 1.6.0 > Reporter: Paul Rogers > Assignee: Bridget Bevens > > Consider the Drill Supported Types page: > https://drill.apache.org/docs/supported-data-types/ > A number of issues can be seen. > For BIGINT, it would be clearer to express the range as: -2^63 to 2^63-1. > For INTEGER, it would be clearer to express the range as: -2^31 to 2^31-1. > DATE: The statement "in YYYY-MM-DD format" is wrong. The internal > representation has no format, it is just a number representing the day count. > The format is applied only on output and varies depending on the tool used. > Perhaps for the Drill web UI it is in ISO format. > DATE: Presumably the date is not time-zone specific. That is, 2016-07-01 is > the first of July in both the US and India, though a given absolute time may > be on two different dates in these locations. > DATE: We use 4713 BC as a 0-point. But, the calendar system has changed many > times since that date. (Indeed, the current system did not even exist on that > date.) Is this a simple projection of the current system back in time, or > does it adjust for the discontinuties in the Gregorian calendar? This should > be stated as it is important for any data files that contain historical > dates. (And is why choosing a 20th-century 0-point would have been better...) > FLOAT, DOUBLE: presumably these are in the standard IEEE Standard 754 format? > If so, let's state that. > INTERVAL: there are many ways that intervals have been represented in DB > systems. Parquet represents data as a triple: months, days and > (milli)seconds. Does Drill use a similar format? If not, what is the format? > A normal DB can declare the interval as part of the data declaration. How > does Drill infer the format? How does the user access the parts of the range? > INTERVAL: the footnote says, "Internally, INTERVAL is represented as > INTERVALDAY or INTERVALYEAR." But, if so, then INTERVAL can't represent a > time interval: a serious limitation. Also, we can't convert a Parquet > Interval to a Drill interval since there is no mapping to Drill that includes > months, days and seconds. This is a huge limitation and should be explained. > SMALLINT: This is a supported types table, but the footnotes say SMALLINT is > not supported. We also do not list the many internal Value Vector types we > don't support (int8, uint8, int16, uint16, uint32 and so on.) Should we list > SMALLINT if we don't actually support it? > TIME: the format is acutally number of seconds since 2001-01-01. The "24-hour > based time ... in hours, minutes, seconds format" confuses display format > with internal representation. See DATE above. > TIME: Presumably the time is in local time, not UTC. That is, the time is > 12:34:56 with the time zone left unspecified. > TIME: The example for TIME is, "22:55:55.23". But, note that the example > shows milliseconds, but the description says the time unit is seconds. Which > is right? > TIME: The example shows just a time (seconds since midnight), but the > description says that this is a timestamp: number of seconds since > 2010-01-01. If so, then is TIME like TIMESTAMP (with a different basis)? Or > is really a time-only value (so that the description is wrong?) > TIMESTAMP: The description says "JDBC timestamp", but this is not accurate. > JDBC is a layer on top of a DB. So, we could say, "JDBC timestamp format". > TIMESTAMP: Explain the basis. A JDBC timestamp > (https://docs.oracle.com/javase/8/docs/api/java/sql/Timestamp.html) expresses > time in nanoseconds since 1970-01-01. So, does Drill also have nanosecond > precision? The docs say, "optional milliseconds", so presumably Drill only > keeps milliseconds, As a result, the Drill timestamp is NOT a JDBC timestamp. > TIMESTAMP: JDBC timestamps are vague. They are based on a Java Date which is > defined as milliseconds since 1970-01-01T00:00:00 UTC. But, it seems a JDBC > timestamp is local (it has no implied timezone). Does Drill assume that a > TIMESTAMP is UTC (like java.util.Date) or local (like java.sql.Date)? > TIMESTAMP & DATE/TIME: We've created an incompatibility between the date & > time format on the one hand, and TIMESTAMP on the other. We should explain > how to convert between the two since it is non-obvious how this would be done > without noodling out the conversion factor. Or, is a Drill timestamp also > based on 2001-01-01 like a Drill date? > TIMESTAMP: "format: yyyy-MM-dd HH:mm:ss.SSS". Again, the timestamp does not > have a format, it is just a count of millis (or nanos, see above.) As > explained above, formatting is done by each tool and can be whatever the user > wants. > CHAR: The description says, "The default limit is 1 character. The maximum > character limit is 2,147,483,647." which seems to apply to the CHAR type: > CHAR(1) to CHAR(2,147,483,647). But, the footnote says, "Currently, Drill > supports only variable-length strings." So, the "default limit..." stuff does > not actually apply, does it? > General: In a full DB, types are important because the user declares columns > of the given type. Thus, I can specify DECIMAL(10,2) or CHAR(5) and it means > something. But, Drill is a query-only engine. So, how are the types used? In > general, types have to be inferred from data (or defined as casts in SQL or > in views). So, we need to describe the type inference for each input source. > And, the semantic rules that apply when converting data inside a view or > query. As an example, what happens when we convert the incompatible Parquet > INTERVAL to a Drill Interval? > General: Each issue should be discussed and resolved with development as some > of the above may be more than just a documentation issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)