Hi! I agree with replacing the types of these columns with DATE. Several queries will probably become a bit faster, but there is also a chance for regression: functions like YEAR() can be potentially faster on YYYY-MM-DD string than DATEs.
> My assumption is that none of the queries have to be changed just the SQL that creates the table. tpch/tpcds databases are also used in the dev environment - do you want to change these tables too, or only the ones used in performance tests? If yes, then there are some potential complications: - I am not sure if all file formats support date - If we change the test tables, then the result type will surely change is several tests, e.g. https://github.com/apache/impala/blob/4fa6b5260d9e28dee63a87de3bea1434706a9d05/testdata/workloads/tpch/queries/tpch-q3.test#L4 - It is also possible that there are some tests that use these date columns to test string specific behaviour. Csaba On Tue, Jan 28, 2020 at 2:32 PM Laszlo Gaal <laszlo.g...@cloudera.com> wrote: > Sounds like a good idea to me. > > A quick scan of the current TPC-H spec at > http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.18.0.pdf > suggests to me that such a change is conformant to the spec. Specifically: > > In 1.3 Data Types: > "Date is a value whose external representation can be expressed as > YYYY-MM-DD, where all characters are numeric. A date must be able to > express any day within at least 14 consecutive years. There is no > requirement specific to the internal representation of a date." > > In 2.1.3 Substitution Parameters and Output Data: > "Comment 1: When dates are part of the substitution parameters, they must > be expressed in a format that includes > the year, month and day in integer form, in that order (e.g., YYYY-MM-DD). > The delimiter between the year, > month and day is not specified. Other date representations, for example the > number of days since 1970-01-01, are > specifically not allowed." > > In 2.2.3 Minor Query Modifications: > "Date expressions - For queries that include an expression involving > manipulation of dates (e.g., > adding/subtracting days/months/years, or extracting years from dates), > vendor-specific syntax may be used > TPC BenchmarkTM H Standard Specification Revision 2.18.0 Page 26 > instead of the specified SQL-92 syntax. Replacement syntax must have > equivalent semantic behavior. > Examples of acceptable implementations include "YEAR(<column>)" to extract > the year from a date > column or "DATE(<date>) + 3 MONTHS" to add 3 months to a date." > > In 4.2 DBGEN and Database Population > 4.2.2 Definition Of Terms: > "4.2.2.8 The term date represents a string of numeric characters separated > by hyphens and comprised of a 4 digit year, 2 digit > month and 2 digit day of the month." > > I believe that our DATE implementation conforms to all these requirements, > so we should be OK. > > Thanks, > > - Laszlo > > On Tue, Jan 28, 2020 at 1:59 PM Gabor Kaszab <gaborkas...@apache.org> > wrote: > > > Hey, > > > > Recently I have been running some perf tests using the TPCH database and > I > > observed that the date columns are still stored as string even though we > > have the date type implemented recently. I'm wondering if there are any > > objections against changing the type of these columns to date. My > > assumption is that none of the queries have to be changed just the SQL > that > > creates the table. > > > > Cheers, > > Gabor > > >