Paul Rogers created DRILL-4765:
----------------------------------

             Summary: Missing, incorrect information in Drill data types page
                 Key: DRILL-4765
                 URL: https://issues.apache.org/jira/browse/DRILL-4765
             Project: Apache Drill
          Issue Type: Improvement
          Components: Documentation
    Affects Versions: 1.6.0
            Reporter: Paul Rogers
            Priority: Minor


Consider the Drill Supported Types page: 
https://drill.apache.org/docs/supported-data-types/

A number of issues can be seen.

For BIGINT, it would be clearer to express the range as: -2^63 to 2^63-1.

For INTEGER, it would be clearer to express the range as: -2^31 to 2^31-1.

DATE: The statement "in YYYY-MM-DD format" is wrong. The internal 
representation has no format, it is just a number representing the day count. 
The format is applied only on output and varies depending on the tool used. 
Perhaps for the Drill web UI it is in ISO format.

DATE: Presumably the date is not time-zone specific. That is, 2016-07-01 is the 
first of July in both the US and India, though a given absolute time may be on 
two different dates in these locations.

DATE: We use 4713 BC as a 0-point. But, the calendar system has changed many 
times since that date. (Indeed, the current system did not even exist on that 
date.) Is this a simple projection of the current system back in time, or does 
it adjust for the discontinuties in the Gregorian calendar? This should be 
stated as it is important for any data files that contain historical dates. 
(And is why choosing a 20th-century 0-point would have been better...)

FLOAT, DOUBLE: presumably these are in the standard IEEE Standard 754 format? 
If so, let's state that.

INTERVAL: there are many ways that intervals have been represented in DB 
systems. Parquet represents data as a triple: months, days and (milli)seconds. 
Does Drill use a similar format? If not, what is the format? A normal DB can 
declare the interval as part of the data declaration. How does Drill infer the 
format? How does the user access the parts of the range?

INTERVAL: the footnote says, "Internally, INTERVAL is represented as 
INTERVALDAY or INTERVALYEAR." But, if so, then INTERVAL can't represent a time 
interval: a serious limitation. Also, we can't convert a Parquet Interval to a 
Drill interval since there is no mapping to Drill that includes months, days 
and seconds. This is a huge limitation and should be explained.

SMALLINT: This is a supported types table, but the footnotes say SMALLINT is 
not supported. We also do not list the many internal Value Vector types we 
don't support (int8, uint8, int16, uint16, uint32 and so on.) Should we list 
SMALLINT if we don't actually support it?

TIME: the format is acutally number of seconds since 2001-01-01. The "24-hour 
based time ... in hours, minutes, seconds format" confuses display format with 
internal representation. See DATE above.

TIME: Presumably the time is in local time, not UTC. That is, the time is 
12:34:56 with the time zone left unspecified.

TIME: The example for TIME is, "22:55:55.23". But, note that the example shows 
milliseconds, but the description says the time unit is seconds. Which is right?

TIME: The example shows just a time (seconds since midnight), but the 
description says that this is a timestamp: number of seconds since 2010-01-01. 
If so, then is TIME like TIMESTAMP (with a different basis)? Or is really a 
time-only value (so that the description is wrong?)

TIMESTAMP: The description says "JDBC timestamp", but this is not accurate. 
JDBC is a layer on top of a DB. So, we could say, "JDBC timestamp format".

TIMESTAMP: Explain the basis. A JDBC timestamp 
(https://docs.oracle.com/javase/8/docs/api/java/sql/Timestamp.html) expresses 
time in nanoseconds since 1970-01-01. So, does Drill also have nanosecond 
precision? The docs say, "optional milliseconds", so presumably Drill only 
keeps milliseconds, As a result, the Drill timestamp is NOT a JDBC timestamp.

TIMESTAMP: JDBC timestamps are vague. They are based on a Java Date which is 
defined as milliseconds since 1970-01-01T00:00:00 UTC. But, it seems a JDBC 
timestamp is local (it has no implied timezone). Does Drill assume that a 
TIMESTAMP is UTC (like java.util.Date) or local (like java.sql.Date)?

TIMESTAMP & DATE/TIME: We've created an incompatibility between the date & time 
format on the one hand, and TIMESTAMP on the other. We should explain how to 
convert between the two since it is non-obvious how this would be done without 
noodling out the conversion factor. Or, is a Drill timestamp also based on 
2001-01-01 like a Drill date?

TIMESTAMP: "format: yyyy-MM-dd HH:mm:ss.SSS". Again, the timestamp does not 
have a format, it is just a count of millis (or nanos, see above.) As explained 
above, formatting is done by each tool and can be whatever the user wants.

CHAR: The description says, "The default limit is 1 character. The maximum 
character limit is 2,147,483,647." which seems to apply to the CHAR type: 
CHAR(1) to CHAR(2,147,483,647). But, the footnote says, "Currently, Drill 
supports only variable-length strings." So, the "default limit..." stuff does 
not actually apply, does it?

General: In a full DB, types are important because the user declares columns of 
the given type. Thus, I can specify DECIMAL(10,2) or CHAR(5) and it means 
something. But, Drill is a query-only engine. So, how are the types used? In 
general, types have to be inferred from data (or defined as casts in SQL or in 
views). So, we need to describe the type inference for each input source. And, 
the semantic rules that apply when converting data inside a view or query. As 
an example, what happens when we convert the incompatible Parquet INTERVAL to a 
Drill Interval?

General: Each issue should be discussed and resolved with development as some 
of the above may be more than just a documentation issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to