[ 
https://issues.apache.org/jira/browse/DRILL-4765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365240#comment-15365240
 ] 

Paul Rogers commented on DRILL-4765:
------------------------------------

I've not personally tested Drill's Parquet file write support. However, if a 
date conversion is needed on write from Drill format to Parquet format, then 
that change should also be handled by a fix for this issue.

> Missing, incorrect information in Drill data types page
> -------------------------------------------------------
>
>                 Key: DRILL-4765
>                 URL: https://issues.apache.org/jira/browse/DRILL-4765
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Documentation
>    Affects Versions: 1.6.0
>            Reporter: Paul Rogers
>            Assignee: Bridget Bevens
>
> Consider the Drill Supported Types page: 
> https://drill.apache.org/docs/supported-data-types/
> A number of issues can be seen.
> For BIGINT, it would be clearer to express the range as: -2^63 to 2^63-1.
> For INTEGER, it would be clearer to express the range as: -2^31 to 2^31-1.
> DATE: The statement "in YYYY-MM-DD format" is wrong. The internal 
> representation has no format, it is just a number representing the day count. 
> The format is applied only on output and varies depending on the tool used. 
> Perhaps for the Drill web UI it is in ISO format.
> DATE: Presumably the date is not time-zone specific. That is, 2016-07-01 is 
> the first of July in both the US and India, though a given absolute time may 
> be on two different dates in these locations.
> DATE: We use 4713 BC as a 0-point. But, the calendar system has changed many 
> times since that date. (Indeed, the current system did not even exist on that 
> date.) Is this a simple projection of the current system back in time, or 
> does it adjust for the discontinuties in the Gregorian calendar? This should 
> be stated as it is important for any data files that contain historical 
> dates. (And is why choosing a 20th-century 0-point would have been better...)
> FLOAT, DOUBLE: presumably these are in the standard IEEE Standard 754 format? 
> If so, let's state that.
> INTERVAL: there are many ways that intervals have been represented in DB 
> systems. Parquet represents data as a triple: months, days and 
> (milli)seconds. Does Drill use a similar format? If not, what is the format? 
> A normal DB can declare the interval as part of the data declaration. How 
> does Drill infer the format? How does the user access the parts of the range?
> INTERVAL: the footnote says, "Internally, INTERVAL is represented as 
> INTERVALDAY or INTERVALYEAR." But, if so, then INTERVAL can't represent a 
> time interval: a serious limitation. Also, we can't convert a Parquet 
> Interval to a Drill interval since there is no mapping to Drill that includes 
> months, days and seconds. This is a huge limitation and should be explained.
> SMALLINT: This is a supported types table, but the footnotes say SMALLINT is 
> not supported. We also do not list the many internal Value Vector types we 
> don't support (int8, uint8, int16, uint16, uint32 and so on.) Should we list 
> SMALLINT if we don't actually support it?
> TIME: the format is acutally number of seconds since 2001-01-01. The "24-hour 
> based time ... in hours, minutes, seconds format" confuses display format 
> with internal representation. See DATE above.
> TIME: Presumably the time is in local time, not UTC. That is, the time is 
> 12:34:56 with the time zone left unspecified.
> TIME: The example for TIME is, "22:55:55.23". But, note that the example 
> shows milliseconds, but the description says the time unit is seconds. Which 
> is right?
> TIME: The example shows just a time (seconds since midnight), but the 
> description says that this is a timestamp: number of seconds since 
> 2010-01-01. If so, then is TIME like TIMESTAMP (with a different basis)? Or 
> is really a time-only value (so that the description is wrong?)
> TIMESTAMP: The description says "JDBC timestamp", but this is not accurate. 
> JDBC is a layer on top of a DB. So, we could say, "JDBC timestamp format".
> TIMESTAMP: Explain the basis. A JDBC timestamp 
> (https://docs.oracle.com/javase/8/docs/api/java/sql/Timestamp.html) expresses 
> time in nanoseconds since 1970-01-01. So, does Drill also have nanosecond 
> precision? The docs say, "optional milliseconds", so presumably Drill only 
> keeps milliseconds, As a result, the Drill timestamp is NOT a JDBC timestamp.
> TIMESTAMP: JDBC timestamps are vague. They are based on a Java Date which is 
> defined as milliseconds since 1970-01-01T00:00:00 UTC. But, it seems a JDBC 
> timestamp is local (it has no implied timezone). Does Drill assume that a 
> TIMESTAMP is UTC (like java.util.Date) or local (like java.sql.Date)?
> TIMESTAMP & DATE/TIME: We've created an incompatibility between the date & 
> time format on the one hand, and TIMESTAMP on the other. We should explain 
> how to convert between the two since it is non-obvious how this would be done 
> without noodling out the conversion factor. Or, is a Drill timestamp also 
> based on 2001-01-01 like a Drill date?
> TIMESTAMP: "format: yyyy-MM-dd HH:mm:ss.SSS". Again, the timestamp does not 
> have a format, it is just a count of millis (or nanos, see above.) As 
> explained above, formatting is done by each tool and can be whatever the user 
> wants.
> CHAR: The description says, "The default limit is 1 character. The maximum 
> character limit is 2,147,483,647." which seems to apply to the CHAR type: 
> CHAR(1) to CHAR(2,147,483,647). But, the footnote says, "Currently, Drill 
> supports only variable-length strings." So, the "default limit..." stuff does 
> not actually apply, does it?
> General: In a full DB, types are important because the user declares columns 
> of the given type. Thus, I can specify DECIMAL(10,2) or CHAR(5) and it means 
> something. But, Drill is a query-only engine. So, how are the types used? In 
> general, types have to be inferred from data (or defined as casts in SQL or 
> in views). So, we need to describe the type inference for each input source. 
> And, the semantic rules that apply when converting data inside a view or 
> query. As an example, what happens when we convert the incompatible Parquet 
> INTERVAL to a Drill Interval?
> General: Each issue should be discussed and resolved with development as some 
> of the above may be more than just a documentation issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to