Hi!

I agree with replacing the types of these columns with DATE.
Several queries will probably become a bit faster, but there is also a
chance for regression: functions like YEAR() can be potentially faster on
YYYY-MM-DD string than DATEs.

> My assumption is that none of the queries have to be changed just the SQL
that creates the table.
tpch/tpcds databases are also used in the dev environment - do you want to
change these tables too, or only the ones used in performance tests?
If yes, then there are some potential complications:
- I am not sure if all file formats support date
- If we change the test tables, then the result type will surely change is
several tests, e.g.
https://github.com/apache/impala/blob/4fa6b5260d9e28dee63a87de3bea1434706a9d05/testdata/workloads/tpch/queries/tpch-q3.test#L4
- It is also possible that there are some tests that use these date columns
to test string specific behaviour.

Csaba

On Tue, Jan 28, 2020 at 2:32 PM Laszlo Gaal <laszlo.g...@cloudera.com>
wrote:

> Sounds like a good idea to me.
>
> A quick scan of the current  TPC-H spec at
> http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.18.0.pdf
> suggests to me that such a change is conformant to the spec. Specifically:
>
> In 1.3 Data Types:
> "Date is a value whose external representation can be expressed as
> YYYY-MM-DD, where all characters are numeric. A date must be able to
> express any day within at least 14 consecutive years. There is no
> requirement specific to the internal representation of a date."
>
> In 2.1.3 Substitution Parameters and Output Data:
> "Comment 1: When dates are part of the substitution parameters, they must
> be expressed in a format that includes
> the year, month and day in integer form, in that order (e.g., YYYY-MM-DD).
> The delimiter between the year,
> month and day is not specified. Other date representations, for example the
> number of days since 1970-01-01, are
> specifically not allowed."
>
> In 2.2.3 Minor Query Modifications:
> "Date expressions - For queries that include an expression involving
> manipulation of dates (e.g.,
> adding/subtracting days/months/years, or extracting years from dates),
> vendor-specific syntax may be used
> TPC BenchmarkTM H Standard Specification Revision 2.18.0 Page 26
> instead of the specified SQL-92 syntax. Replacement syntax must have
> equivalent semantic behavior.
> Examples of acceptable implementations include "YEAR(<column>)" to extract
> the year from a date
> column or "DATE(<date>) + 3 MONTHS" to add 3 months to a date."
>
> In 4.2 DBGEN and Database Population
> 4.2.2 Definition Of Terms:
> "4.2.2.8 The term date represents a string of numeric characters separated
> by hyphens and comprised of a 4 digit year, 2 digit
> month and 2 digit day of the month."
>
> I believe that our DATE implementation conforms to all these requirements,
> so we should be OK.
>
> Thanks,
>
>   - Laszlo
>
> On Tue, Jan 28, 2020 at 1:59 PM Gabor Kaszab <gaborkas...@apache.org>
> wrote:
>
> > Hey,
> >
> > Recently I have been running some perf tests using the TPCH database and
> I
> > observed that the date columns are still stored as string even though we
> > have the date type implemented recently. I'm wondering if there are any
> > objections against changing the type of these columns to date. My
> > assumption is that none of the queries have to be changed just the SQL
> that
> > creates the table.
> >
> > Cheers,
> > Gabor
> >
>

Reply via email to