Re: [DISCUSS] SPIP: Add the TIME data type

Subhasis Mukherjee Mon, 17 Feb 2025 10:37:07 -0800

Implenting the type seems a good proposal to handle the mention use cases,
mainly the migration of data.
Many circuitous code can be written to handle such scenario, but nothing
beats a straightforward type implementation IMO.


Thanks,
Subhasis Mukherjee

On Mon, Feb 17, 2025, 9:37 PM Max Gekk <max.g...@gmail.com> wrote:

> Hello Mich,
>
> Thank you for the provided code, but it seems useless in the cases that I
> described above. No doubt that you can emulate the TIME type via STRING as
> well as other types. Let me highlight the cases when direct support of the
> new type by Spark SQL could be useful for users:
> 1. Load the TIME values from datasources that support such type, for
> instance from parquet.
> 2. Save values to existing tables with TIME columns via JDBC. Such tables
> can be created by many dbms.
> 3. Migrate existing SQL code which operates over TIME code from other
> systems to Spark SQL. For example:
> ===
> SELECT make_timestamp(date'2025-02-17', open_time +
> holiday_shift_interval), storeid
> FROM store_info
> WHERE time_trunc('HOUR', open_time) - time'06:00' >= interval '3' HOUR
> ===
> When the target system doesn't support the TIME type natively, and the
> user has tons of such code, it might be a stopping factor for starting the
> migration.
> 4. Even if a user doesn't use a TIME type column of a remote table, she/he
> cannot manage or get access to such a table from Spark for now because this
> user just cannot create a table w/ TIME column in Spark SQL.
> 5. Additional optimizations like predicates pushdown, dynamic partition
> pruning over TIME values in datasources might speed up user queries.
> 6. Automatic TIME type inference from datasources simplifies usage of
> Spark SQL.
>
> Yours faithfully,
> Max Gekk
>
>
> On Fri, Feb 14, 2025 at 1:53 AM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> hm, I tried the attached code. This code tries to simulates handling TIME
>> data in Spark using Parquet files. Since Spark does not support a direct
>> TIME datatype, it follows these steps:
>>
>>    - Stores time as a STRING in a Parquet file using PyArrow.
>>    - Reads the Parquet file using PyArrow, Pandas, and Spark to verify
>>    the data.
>>    - Converts STRING to TIMESTAMP in Spark.
>>    - Extracts hour, minute, and second from the TIMESTAMP column.
>>
>> If you run it, you will get this
>>
>> Parquet file 'time_example_fixed.parquet' created successfully.
>> Checking if Parquet file exists...
>> Parquet file 'time_example_fixed.parquet' exists.
>> File size: 1686 bytes
>> Parquet file is not empty.
>> Reading with PyArrow
>> PyArrow Data Preview:
>>         event_time
>> 0  12:34:56.789000
>> 1  23:45:43.210000
>>
>> Reading with Pandas
>> Pandas Data Preview:
>>         event_time
>> 0  12:34:56.789000
>> 1  23:45:43.210000
>>
>> Reading with Spark
>>
>> Read the Parquet file using Spark
>>
>> Data loaded from Parquet file:
>> +----------+
>> |event_time|
>> +----------+
>> +----------+
>>
>> root
>>  |-- event_time: string (nullable = true)
>>
>> Convert STRING time to TIMESTAMP HH:mm:ss.SSSSSS format
>> Extract hours, minutes, and seconds
>> Data with extracted time components
>> +----------+----+------+------+
>> |event_time|hour|minute|second|
>> +----------+----+------+------+
>> +----------+----+------+------+
>>
>>  It fails to display data in Spark while PyArrow and Pandas work fine.
>> This suggests that Spark is failing to interpret the Parquet file
>> correctly.
>> The problem is likely caused by one or more of the following issues.
>>
>>    1. PyArrow-Written Parquet Metadata Is not fully compatible with Spark
>>    2. Spark Reads the Schema but fails to detect rows
>>    3. Schema mismatch between PyArrow and Spark
>>    4. Corrupted or Empty Parquet File
>>
>> Feel free to take the program and make it work
>>
>> HTH
>>
>> Dr Mich Talebzadeh,
>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>
>>
>> On Wed, 12 Feb 2025 at 19:53, Max Gekk <max.g...@gmail.com> wrote:
>>
>>> Hello Mich,
>>>
>>> > However, if you only need to work with time, you can do like below
>>>
>>> 1. Let's say a Spark SQL user would like to load TIME values stored in
>>> files in the parquet format which supports the TIME logical type
>>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#time.
>>> None of your proposed workarounds allow loading such data.
>>> 2. One of the goals to introduce the new type is simplifying migrations
>>> from systems that support the TIME type like PostgreSQL, Snowflake, Google
>>> SQL, Amazon Redshift, Teradata, DB2 to Spark SQL. Rewriting their existing
>>> SQL code to store TIME values as DATE or INTERVAL DAY TO SECOND looks ugly,
>>> IMHO, and can lead to error prone code.
>>>
>>> Yours faithfully,
>>> Max Gekk
>>>
>>> On Wed, Feb 12, 2025 at 8:13 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Not entirely convinced we need it!
>>>>
>>>> For example, Oracle does not have it.Oracle treats date and time as a
>>>> single entity, as they are often used together in real-world applications.
>>>> This approach simplifies many operations, such as sorting, filtering, and
>>>> calculations involving both date and time. However, if you only need to
>>>> work with time, you can do like below
>>>>
>>>>    1. Use DATE or TIMESTAMP to store time, and extract the time
>>>>    portion using TO_CHAR.
>>>>    2. Use INTERVAL DAY TO SECOND for durations or time intervals.
>>>>    3. If you only care about time, you can ignore the date portion or
>>>>    set it to a default value.
>>>>
>>>>  HTH
>>>>
>>>> Dr Mich Talebzadeh,
>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>> On Wed, 12 Feb 2025 at 18:56, Sakthi <sak...@apache.org> wrote:
>>>>
>>>>> Thanks for the proposal, Max. This looks very promising. I'd also be
>>>>> happy to contribute if it helps with task completion!
>>>>>
>>>>> Regards,
>>>>> Sakthi
>>>>>
>>>>> On Wed, Feb 12, 2025 at 10:36 AM Max Gekk <max.g...@gmail.com> wrote:
>>>>>
>>>>>> Hi Dongjoon,
>>>>>>
>>>>>> > According to SPIP, is this targeting Apache Spark 4.2.0?
>>>>>>
>>>>>> Some tasks could be done in parallel, but if only one person will
>>>>>> work on this sequentially, in the worst case it might be finished close 
>>>>>> to
>>>>>> 4.2.0.
>>>>>>
>>>>>> Best regards,
>>>>>> Max Gekk
>>>>>>
>>>>>> On Wed, Feb 12, 2025 at 5:48 PM Dongjoon Hyun <dongj...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> According to SPIP, is this targeting Apache Spark 4.2.0?
>>>>>>>
>>>>>>> > Q7. How long will it take?
>>>>>>> > In total it might take around 9 months.
>>>>>>>
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>> On 2025/02/12 09:38:56 Max Gekk wrote:
>>>>>>> > Hi All,
>>>>>>> >
>>>>>>> > I would like to propose a new data type TIME which represents only
>>>>>>> time
>>>>>>> > values without the date part comparing to TIMESTAMP_NTZ. New type
>>>>>>> should
>>>>>>> > improve:
>>>>>>> > - migrations of SQL code from other DBMS where such type is
>>>>>>> supported
>>>>>>> > - read/write it from/to data sources such as parquet
>>>>>>> > - conform to the SQL standard
>>>>>>> >
>>>>>>> > SPIP: https://issues.apache.org/jira/browse/SPARK-51162
>>>>>>> >
>>>>>>> > Your comments and feedback would be greatly appreciated.
>>>>>>> >
>>>>>>> > Yours faithfully,
>>>>>>> > Max Gekk
>>>>>>> >
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>>

Re: [DISCUSS] SPIP: Add the TIME data type

Reply via email to