Re: [DISCUSS] SPIP: Add the TIME data type

Max Gekk Mon, 17 Feb 2025 08:08:04 -0800

Hello Mich,

Thank you for the provided code, but it seems useless in the cases that I
described above. No doubt that you can emulate the TIME type via STRING as
well as other types. Let me highlight the cases when direct support of the
new type by Spark SQL could be useful for users:
1. Load the TIME values from datasources that support such type, for
instance from parquet.
2. Save values to existing tables with TIME columns via JDBC. Such tables
can be created by many dbms.
3. Migrate existing SQL code which operates over TIME code from other
systems to Spark SQL. For example:
===
SELECT make_timestamp(date'2025-02-17', open_time +
holiday_shift_interval), storeid
FROM store_info
WHERE time_trunc('HOUR', open_time) - time'06:00' >= interval '3' HOUR
===
When the target system doesn't support the TIME type natively, and the user
has tons of such code, it might be a stopping factor for starting the
migration.
4. Even if a user doesn't use a TIME type column of a remote table, she/he
cannot manage or get access to such a table from Spark for now because this
user just cannot create a table w/ TIME column in Spark SQL.
5. Additional optimizations like predicates pushdown, dynamic partition
pruning over TIME values in datasources might speed up user queries.
6. Automatic TIME type inference from datasources simplifies usage of Spark
SQL.


Yours faithfully,
Max Gekk


On Fri, Feb 14, 2025 at 1:53 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> hm, I tried the attached code. This code tries to simulates handling TIME
> data in Spark using Parquet files. Since Spark does not support a direct
> TIME datatype, it follows these steps:
>
>    - Stores time as a STRING in a Parquet file using PyArrow.
>    - Reads the Parquet file using PyArrow, Pandas, and Spark to verify
>    the data.
>    - Converts STRING to TIMESTAMP in Spark.
>    - Extracts hour, minute, and second from the TIMESTAMP column.
>
> If you run it, you will get this
>
> Parquet file 'time_example_fixed.parquet' created successfully.
> Checking if Parquet file exists...
> Parquet file 'time_example_fixed.parquet' exists.
> File size: 1686 bytes
> Parquet file is not empty.
> Reading with PyArrow
> PyArrow Data Preview:
>         event_time
> 0  12:34:56.789000
> 1  23:45:43.210000
>
> Reading with Pandas
> Pandas Data Preview:
>         event_time
> 0  12:34:56.789000
> 1  23:45:43.210000
>
> Reading with Spark
>
> Read the Parquet file using Spark
>
> Data loaded from Parquet file:
> +----------+
> |event_time|
> +----------+
> +----------+
>
> root
>  |-- event_time: string (nullable = true)
>
> Convert STRING time to TIMESTAMP HH:mm:ss.SSSSSS format
> Extract hours, minutes, and seconds
> Data with extracted time components
> +----------+----+------+------+
> |event_time|hour|minute|second|
> +----------+----+------+------+
> +----------+----+------+------+
>
>  It fails to display data in Spark while PyArrow and Pandas work fine.
> This suggests that Spark is failing to interpret the Parquet file
> correctly.
> The problem is likely caused by one or more of the following issues.
>
>    1. PyArrow-Written Parquet Metadata Is not fully compatible with Spark
>    2. Spark Reads the Schema but fails to detect rows
>    3. Schema mismatch between PyArrow and Spark
>    4. Corrupted or Empty Parquet File
>
> Feel free to take the program and make it work
>
> HTH
>
> Dr Mich Talebzadeh,
> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>
>
> On Wed, 12 Feb 2025 at 19:53, Max Gekk <max.g...@gmail.com> wrote:
>
>> Hello Mich,
>>
>> > However, if you only need to work with time, you can do like below
>>
>> 1. Let's say a Spark SQL user would like to load TIME values stored in
>> files in the parquet format which supports the TIME logical type
>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#time.
>> None of your proposed workarounds allow loading such data.
>> 2. One of the goals to introduce the new type is simplifying migrations
>> from systems that support the TIME type like PostgreSQL, Snowflake, Google
>> SQL, Amazon Redshift, Teradata, DB2 to Spark SQL. Rewriting their existing
>> SQL code to store TIME values as DATE or INTERVAL DAY TO SECOND looks ugly,
>> IMHO, and can lead to error prone code.
>>
>> Yours faithfully,
>> Max Gekk
>>
>> On Wed, Feb 12, 2025 at 8:13 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Not entirely convinced we need it!
>>>
>>> For example, Oracle does not have it.Oracle treats date and time as a
>>> single entity, as they are often used together in real-world applications.
>>> This approach simplifies many operations, such as sorting, filtering, and
>>> calculations involving both date and time. However, if you only need to
>>> work with time, you can do like below
>>>
>>>    1. Use DATE or TIMESTAMP to store time, and extract the time portion
>>>    using TO_CHAR.
>>>    2. Use INTERVAL DAY TO SECOND for durations or time intervals.
>>>    3. If you only care about time, you can ignore the date portion or
>>>    set it to a default value.
>>>
>>>  HTH
>>>
>>> Dr Mich Talebzadeh,
>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>> On Wed, 12 Feb 2025 at 18:56, Sakthi <sak...@apache.org> wrote:
>>>
>>>> Thanks for the proposal, Max. This looks very promising. I'd also be
>>>> happy to contribute if it helps with task completion!
>>>>
>>>> Regards,
>>>> Sakthi
>>>>
>>>> On Wed, Feb 12, 2025 at 10:36 AM Max Gekk <max.g...@gmail.com> wrote:
>>>>
>>>>> Hi Dongjoon,
>>>>>
>>>>> > According to SPIP, is this targeting Apache Spark 4.2.0?
>>>>>
>>>>> Some tasks could be done in parallel, but if only one person will work
>>>>> on this sequentially, in the worst case it might be finished close to 
>>>>> 4.2.0.
>>>>>
>>>>> Best regards,
>>>>> Max Gekk
>>>>>
>>>>> On Wed, Feb 12, 2025 at 5:48 PM Dongjoon Hyun <dongj...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> According to SPIP, is this targeting Apache Spark 4.2.0?
>>>>>>
>>>>>> > Q7. How long will it take?
>>>>>> > In total it might take around 9 months.
>>>>>>
>>>>>> Dongjoon.
>>>>>>
>>>>>> On 2025/02/12 09:38:56 Max Gekk wrote:
>>>>>> > Hi All,
>>>>>> >
>>>>>> > I would like to propose a new data type TIME which represents only
>>>>>> time
>>>>>> > values without the date part comparing to TIMESTAMP_NTZ. New type
>>>>>> should
>>>>>> > improve:
>>>>>> > - migrations of SQL code from other DBMS where such type is
>>>>>> supported
>>>>>> > - read/write it from/to data sources such as parquet
>>>>>> > - conform to the SQL standard
>>>>>> >
>>>>>> > SPIP: https://issues.apache.org/jira/browse/SPARK-51162
>>>>>> >
>>>>>> > Your comments and feedback would be greatly appreciated.
>>>>>> >
>>>>>> > Yours faithfully,
>>>>>> > Max Gekk
>>>>>> >
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>
>>>>>>

Re: [DISCUSS] SPIP: Add the TIME data type

Reply via email to