Implenting the type seems a good proposal to handle the mention use cases, mainly the migration of data. Many circuitous code can be written to handle such scenario, but nothing beats a straightforward type implementation IMO.
Thanks, Subhasis Mukherjee On Mon, Feb 17, 2025, 9:37 PM Max Gekk <max.g...@gmail.com> wrote: > Hello Mich, > > Thank you for the provided code, but it seems useless in the cases that I > described above. No doubt that you can emulate the TIME type via STRING as > well as other types. Let me highlight the cases when direct support of the > new type by Spark SQL could be useful for users: > 1. Load the TIME values from datasources that support such type, for > instance from parquet. > 2. Save values to existing tables with TIME columns via JDBC. Such tables > can be created by many dbms. > 3. Migrate existing SQL code which operates over TIME code from other > systems to Spark SQL. For example: > === > SELECT make_timestamp(date'2025-02-17', open_time + > holiday_shift_interval), storeid > FROM store_info > WHERE time_trunc('HOUR', open_time) - time'06:00' >= interval '3' HOUR > === > When the target system doesn't support the TIME type natively, and the > user has tons of such code, it might be a stopping factor for starting the > migration. > 4. Even if a user doesn't use a TIME type column of a remote table, she/he > cannot manage or get access to such a table from Spark for now because this > user just cannot create a table w/ TIME column in Spark SQL. > 5. Additional optimizations like predicates pushdown, dynamic partition > pruning over TIME values in datasources might speed up user queries. > 6. Automatic TIME type inference from datasources simplifies usage of > Spark SQL. > > Yours faithfully, > Max Gekk > > > On Fri, Feb 14, 2025 at 1:53 AM Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> hm, I tried the attached code. This code tries to simulates handling TIME >> data in Spark using Parquet files. Since Spark does not support a direct >> TIME datatype, it follows these steps: >> >> - Stores time as a STRING in a Parquet file using PyArrow. >> - Reads the Parquet file using PyArrow, Pandas, and Spark to verify >> the data. >> - Converts STRING to TIMESTAMP in Spark. >> - Extracts hour, minute, and second from the TIMESTAMP column. >> >> If you run it, you will get this >> >> Parquet file 'time_example_fixed.parquet' created successfully. >> Checking if Parquet file exists... >> Parquet file 'time_example_fixed.parquet' exists. >> File size: 1686 bytes >> Parquet file is not empty. >> Reading with PyArrow >> PyArrow Data Preview: >> event_time >> 0 12:34:56.789000 >> 1 23:45:43.210000 >> >> Reading with Pandas >> Pandas Data Preview: >> event_time >> 0 12:34:56.789000 >> 1 23:45:43.210000 >> >> Reading with Spark >> >> Read the Parquet file using Spark >> >> Data loaded from Parquet file: >> +----------+ >> |event_time| >> +----------+ >> +----------+ >> >> root >> |-- event_time: string (nullable = true) >> >> Convert STRING time to TIMESTAMP HH:mm:ss.SSSSSS format >> Extract hours, minutes, and seconds >> Data with extracted time components >> +----------+----+------+------+ >> |event_time|hour|minute|second| >> +----------+----+------+------+ >> +----------+----+------+------+ >> >> It fails to display data in Spark while PyArrow and Pandas work fine. >> This suggests that Spark is failing to interpret the Parquet file >> correctly. >> The problem is likely caused by one or more of the following issues. >> >> 1. PyArrow-Written Parquet Metadata Is not fully compatible with Spark >> 2. Spark Reads the Schema but fails to detect rows >> 3. Schema mismatch between PyArrow and Spark >> 4. Corrupted or Empty Parquet File >> >> Feel free to take the program and make it work >> >> HTH >> >> Dr Mich Talebzadeh, >> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> >> >> >> On Wed, 12 Feb 2025 at 19:53, Max Gekk <max.g...@gmail.com> wrote: >> >>> Hello Mich, >>> >>> > However, if you only need to work with time, you can do like below >>> >>> 1. Let's say a Spark SQL user would like to load TIME values stored in >>> files in the parquet format which supports the TIME logical type >>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#time. >>> None of your proposed workarounds allow loading such data. >>> 2. One of the goals to introduce the new type is simplifying migrations >>> from systems that support the TIME type like PostgreSQL, Snowflake, Google >>> SQL, Amazon Redshift, Teradata, DB2 to Spark SQL. Rewriting their existing >>> SQL code to store TIME values as DATE or INTERVAL DAY TO SECOND looks ugly, >>> IMHO, and can lead to error prone code. >>> >>> Yours faithfully, >>> Max Gekk >>> >>> On Wed, Feb 12, 2025 at 8:13 PM Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> Not entirely convinced we need it! >>>> >>>> For example, Oracle does not have it.Oracle treats date and time as a >>>> single entity, as they are often used together in real-world applications. >>>> This approach simplifies many operations, such as sorting, filtering, and >>>> calculations involving both date and time. However, if you only need to >>>> work with time, you can do like below >>>> >>>> 1. Use DATE or TIMESTAMP to store time, and extract the time >>>> portion using TO_CHAR. >>>> 2. Use INTERVAL DAY TO SECOND for durations or time intervals. >>>> 3. If you only care about time, you can ignore the date portion or >>>> set it to a default value. >>>> >>>> HTH >>>> >>>> Dr Mich Talebzadeh, >>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> On Wed, 12 Feb 2025 at 18:56, Sakthi <sak...@apache.org> wrote: >>>> >>>>> Thanks for the proposal, Max. This looks very promising. I'd also be >>>>> happy to contribute if it helps with task completion! >>>>> >>>>> Regards, >>>>> Sakthi >>>>> >>>>> On Wed, Feb 12, 2025 at 10:36 AM Max Gekk <max.g...@gmail.com> wrote: >>>>> >>>>>> Hi Dongjoon, >>>>>> >>>>>> > According to SPIP, is this targeting Apache Spark 4.2.0? >>>>>> >>>>>> Some tasks could be done in parallel, but if only one person will >>>>>> work on this sequentially, in the worst case it might be finished close >>>>>> to >>>>>> 4.2.0. >>>>>> >>>>>> Best regards, >>>>>> Max Gekk >>>>>> >>>>>> On Wed, Feb 12, 2025 at 5:48 PM Dongjoon Hyun <dongj...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> According to SPIP, is this targeting Apache Spark 4.2.0? >>>>>>> >>>>>>> > Q7. How long will it take? >>>>>>> > In total it might take around 9 months. >>>>>>> >>>>>>> Dongjoon. >>>>>>> >>>>>>> On 2025/02/12 09:38:56 Max Gekk wrote: >>>>>>> > Hi All, >>>>>>> > >>>>>>> > I would like to propose a new data type TIME which represents only >>>>>>> time >>>>>>> > values without the date part comparing to TIMESTAMP_NTZ. New type >>>>>>> should >>>>>>> > improve: >>>>>>> > - migrations of SQL code from other DBMS where such type is >>>>>>> supported >>>>>>> > - read/write it from/to data sources such as parquet >>>>>>> > - conform to the SQL standard >>>>>>> > >>>>>>> > SPIP: https://issues.apache.org/jira/browse/SPARK-51162 >>>>>>> > >>>>>>> > Your comments and feedback would be greatly appreciated. >>>>>>> > >>>>>>> > Yours faithfully, >>>>>>> > Max Gekk >>>>>>> > >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>> >>>>>>>