Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Mich Talebzadeh Fri, 26 Apr 2024 06:00:17 -0700

Hi,

I would like to add a side note regarding the discussion process and the
current title of the proposal. The title '[DISCUSS] SPARK-46122: Set
spark.sql.legacy.createHiveTableByDefault to false' focuses on a specific
configuration parameter, which might lead some participants to overlook its
broader implications (as was raised by myself and others). I believe that a
more descriptive title, encompassing the broader discussion on default
behaviours for creating Hive tables in Spark SQL, could enable greater
engagement within the community. This is an important topic that deserves
thorough consideration.


HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Fri, 26 Apr 2024 at 07:13, L. C. Hsieh <vii...@gmail.com> wrote:

> +1
>
> On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang <yumw...@apache.org> wrote:
>
>> +1
>>
>> On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek <ofek.nim...@gmail.com>
>> wrote:
>>
>>> Of course, I can't think of a scenario of thousands of tables with
>>> single in memory Spark cluster with in memory catalog.
>>> Thanks for the help!
>>>
>>> בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏<
>>> mich.talebza...@gmail.com>:
>>>
>>>>
>>>>
>>>> Agreed. In scenarios where most of the interactions with the catalog
>>>> are related to query planning, saving and metadata management, the choice
>>>> of catalog implementation may have less impact on query runtime 
>>>> performance.
>>>> This is because the time spent on metadata operations is generally
>>>> minimal compared to the time spent on actual data fetching, processing, and
>>>> computation.
>>>> However, if we consider scalability and reliability concerns,
>>>> especially as the size and complexity of data and query workload grow.
>>>> While an in-memory catalog may offer excellent performance for smaller
>>>> workloads,
>>>> it will face limitations in handling larger-scale deployments with
>>>> thousands of tables, partitions, and users. Additionally, durability and
>>>> persistence are crucial considerations, particularly in production
>>>> environments where data integrity
>>>> and availability are crucial. In-memory catalog implementations may
>>>> lack durability, meaning that metadata changes could be lost in the event
>>>> of a system failure or restart. Therefore, while in-memory catalog
>>>> implementations can provide speed and efficiency for certain use cases, we
>>>> ought to consider the requirements for scalability, reliability, and data
>>>> durability when choosing a catalog solution for production deployments. In
>>>> many cases, a combination of in-memory and disk-based catalog solutions may
>>>> offer the best balance of performance and resilience for demanding large
>>>> scale workloads.
>>>>
>>>>
>>>> HTH
>>>>
>>>>
>>>> Mich Talebzadeh,
>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* The information provided is correct to the best of my
>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>> expert opinions (Werner
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>
>>>>
>>>> On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek <ofek.nim...@gmail.com>
>>>> wrote:
>>>>
>>>>> Of course, but it's in memory and not persisted which is much faster,
>>>>> and as I said- I believe that most of the interaction with it is during 
>>>>> the
>>>>> planning and save and not actual query run operations, and they are short
>>>>> and minimal compared to data fetching and manipulation so I don't believe
>>>>> it will have big impact on query run...
>>>>>
>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh ‏<
>>>>> mich.talebza...@gmail.com>:
>>>>>
>>>>>> Well, I will be surprised because Derby database is single threaded
>>>>>> and won't be much of a use here.
>>>>>>
>>>>>> Most Hive metastore in the commercial world utilise postgres or
>>>>>> Oracle for metastore that are battle proven, replicated and backed up.
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>> expert opinions (Werner
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>
>>>>>>
>>>>>> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek <ofek.nim...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes, in memory hive catalog backed by local Derby DB.
>>>>>>> And again, I presume that most metadata related parts are during
>>>>>>> planning and not actual run, so I don't see why it should strongly 
>>>>>>> affect
>>>>>>> query performance.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>>
>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
>>>>>>> mich.talebza...@gmail.com>:
>>>>>>>
>>>>>>>> With regard to your point below
>>>>>>>>
>>>>>>>> "The thing I'm missing is this: let's say that the output format I
>>>>>>>> choose is delta lake or iceberg or whatever format that uses parquet. 
>>>>>>>> Where
>>>>>>>> does the catalog implementation (which holds metadata afaik, same 
>>>>>>>> metadata
>>>>>>>> that iceberg and delta lake save for their tables about their columns)
>>>>>>>> comes into play and why should it affect performance? "
>>>>>>>>
>>>>>>>> The catalog implementation comes into play regardless of the output
>>>>>>>> format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is
>>>>>>>> responsible for managing metadata about the datasets, tables, schemas, 
>>>>>>>> and
>>>>>>>> other objects stored in aforementioned formats. Even though Delta Lake 
>>>>>>>> and
>>>>>>>> Iceberg have their metadata management mechanisms internally, they 
>>>>>>>> still
>>>>>>>> rely on the catalog for providing a unified interface for accessing and
>>>>>>>> manipulating metadata across different storage formats.
>>>>>>>>
>>>>>>>> "Another thing is that if I understand correctly, and I might be
>>>>>>>> totally wrong here, the internal spark catalog is a local installation 
>>>>>>>> of
>>>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with
>>>>>>>> anything"
>>>>>>>>
>>>>>>>> .I don't understand this. Do you mean a Derby database?
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Mich Talebzadeh,
>>>>>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>>>>>> London
>>>>>>>> United Kingdom
>>>>>>>>
>>>>>>>>
>>>>>>>>    view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* The information provided is correct to the best of
>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to 
>>>>>>>> note
>>>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>>>> expert opinions (Werner
>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek <ofek.nim...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the detailed answer.
>>>>>>>>> The thing I'm missing is this: let's say that the output format I
>>>>>>>>> choose is delta lake or iceberg or whatever format that uses parquet. 
>>>>>>>>> Where
>>>>>>>>> does the catalog implementation (which holds metadata afaik, same 
>>>>>>>>> metadata
>>>>>>>>> that iceberg and delta lake save for their tables about their columns)
>>>>>>>>> comes into play and why should it affect performance?
>>>>>>>>> Another thing is that if I understand correctly, and I might be
>>>>>>>>> totally wrong here, the internal spark catalog is a local 
>>>>>>>>> installation of
>>>>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with
>>>>>>>>> anything.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
>>>>>>>>> mich.talebza...@gmail.com>:
>>>>>>>>>
>>>>>>>>>> My take regarding your question is that your mileage varies so to
>>>>>>>>>> speak.
>>>>>>>>>>
>>>>>>>>>> 1) Hive provides a more mature and widely adopted catalog
>>>>>>>>>> solution that integrates well with other components in the Hadoop
>>>>>>>>>> ecosystem, such as HDFS, HBase, and YARN. IIf you are Hadoop centric 
>>>>>>>>>> S(say
>>>>>>>>>> on-premise), using Hive may offer better compatibility and
>>>>>>>>>> interoperability.
>>>>>>>>>> 2) Hive provides a SQL-like interface that is familiar to users
>>>>>>>>>> who are accustomed to traditional RDBMs. If your use case involves 
>>>>>>>>>> complex
>>>>>>>>>> SQL queries or existing SQL-based workflows, using Hive may be 
>>>>>>>>>> advantageous.
>>>>>>>>>> 3) If you are looking for performance, spark's native catalog
>>>>>>>>>> tends to offer better performance for certain workloads, 
>>>>>>>>>> particularly those
>>>>>>>>>> that involve iterative processing or complex data transformations.(my
>>>>>>>>>> understanding). Spark's in-memory processing capabilities and 
>>>>>>>>>> optimizations
>>>>>>>>>> make it well-suited for interactive analytics and machine learning
>>>>>>>>>> tasks.(my favourite)
>>>>>>>>>> 4) Integration with Spark Workflows: If you primarily use Spark
>>>>>>>>>> for data processing and analytics, using Spark's native catalog may
>>>>>>>>>> simplify workflow management and reduce overhead, Spark's  tight
>>>>>>>>>> integration with its catalog allows for seamless interaction with 
>>>>>>>>>> Spark
>>>>>>>>>> applications and libraries.
>>>>>>>>>> 5) There seems to be some similarity with spark catalog and
>>>>>>>>>> Databricks unity catalog, so that may favour the choice.
>>>>>>>>>>
>>>>>>>>>> HTH
>>>>>>>>>>
>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>> Technologist | Architect | Data Engineer  | Generative AI |
>>>>>>>>>> FinCrime
>>>>>>>>>> London
>>>>>>>>>> United Kingdom
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Disclaimer:* The information provided is correct to the best of
>>>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to 
>>>>>>>>>> note
>>>>>>>>>> that, as with any advice, quote "one test result is worth 
>>>>>>>>>> one-thousand
>>>>>>>>>> expert opinions (Werner
>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek <ofek.nim...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I will also appreciate some material that describes the
>>>>>>>>>>> differences between Spark native tables vs hive tables and why each 
>>>>>>>>>>> should
>>>>>>>>>>> be used...
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Nimrod
>>>>>>>>>>>
>>>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏<
>>>>>>>>>>> mich.talebza...@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>> I see a statement made as below  and I quote
>>>>>>>>>>>>
>>>>>>>>>>>> "The proposal of SPARK-46122 is to switch the default value of
>>>>>>>>>>>> this
>>>>>>>>>>>> configuration from `true` to `false` to use Spark native tables
>>>>>>>>>>>> because
>>>>>>>>>>>> we support better."
>>>>>>>>>>>>
>>>>>>>>>>>> Can you please elaborate on the above specifically with regard
>>>>>>>>>>>> to the phrase ".. because
>>>>>>>>>>>> we support better."
>>>>>>>>>>>>
>>>>>>>>>>>> Are you referring to the performance of Spark catalog (I
>>>>>>>>>>>> believe it is internal) or integration with Spark?
>>>>>>>>>>>>
>>>>>>>>>>>> HTH
>>>>>>>>>>>>
>>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>>> Technologist | Architect | Data Engineer  | Generative AI |
>>>>>>>>>>>> FinCrime
>>>>>>>>>>>> London
>>>>>>>>>>>> United Kingdom
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *Disclaimer:* The information provided is correct to the best
>>>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is 
>>>>>>>>>>>> essential to
>>>>>>>>>>>> note that, as with any advice, quote "one test result is worth
>>>>>>>>>>>> one-thousand expert opinions (Werner
>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan <cloud0...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> +1
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao <y...@apache.org>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-44444.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Kent Yao
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dongjoon Hyun <dongjoon.h...@gmail.com> 于2024年4月25日周四
>>>>>>>>>>>>>> 14:39写道：
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Hi, All.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > It's great to see community activities to polish 4.0.0 more
>>>>>>>>>>>>>> and more.
>>>>>>>>>>>>>> > Thank you all.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you
>>>>>>>>>>>>>> from the subtasks
>>>>>>>>>>>>>> > of SPARK-44444 (Prepare Apache Spark 4.0.0),
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122
>>>>>>>>>>>>>> >    Set `spark.sql.legacy.createHiveTableByDefault` to
>>>>>>>>>>>>>> `false` by default
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > This legacy configuration is about `CREATE TABLE` SQL
>>>>>>>>>>>>>> syntax without
>>>>>>>>>>>>>> > `USING` and `STORED AS`, which is currently mapped to
>>>>>>>>>>>>>> `Hive` table.
>>>>>>>>>>>>>> > The proposal of SPARK-46122 is to switch the default value
>>>>>>>>>>>>>> of this
>>>>>>>>>>>>>> > configuration from `true` to `false` to use Spark native
>>>>>>>>>>>>>> tables because
>>>>>>>>>>>>>> > we support better.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > In other words, Spark will use the value of
>>>>>>>>>>>>>> `spark.sql.sources.default`
>>>>>>>>>>>>>> > as the table provider instead of `Hive` like the other
>>>>>>>>>>>>>> Spark APIs. Of course,
>>>>>>>>>>>>>> > the users can get all the legacy behavior by setting back
>>>>>>>>>>>>>> to `true`.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Historically, this behavior change was merged once at
>>>>>>>>>>>>>> Apache Spark 3.0.0
>>>>>>>>>>>>>> > preparation via SPARK-30098 already, but reverted during
>>>>>>>>>>>>>> the 3.0.0 RC period.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as provider
>>>>>>>>>>>>>> for CREATE TABLE
>>>>>>>>>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default
>>>>>>>>>>>>>> datasource as
>>>>>>>>>>>>>> >             provider for CREATE TABLE command
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > At Apache Spark 3.1.0, we had another discussion about this
>>>>>>>>>>>>>> and defined it
>>>>>>>>>>>>>> > as one of legacy behavior via this configuration via reused
>>>>>>>>>>>>>> ID, SPARK-30098.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > 2020-12-01:
>>>>>>>>>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
>>>>>>>>>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use default
>>>>>>>>>>>>>> datasource as
>>>>>>>>>>>>>> >             provider for CREATE TABLE command
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Last year, we received two additional requests twice to
>>>>>>>>>>>>>> switch this because
>>>>>>>>>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for
>>>>>>>>>>>>>> the future direction.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > 2023-02-27: SPARK-42603 as an independent idea.
>>>>>>>>>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > WDYT? The technical scope is defined in the following PR
>>>>>>>>>>>>>> which is one line of main
>>>>>>>>>>>>>> > code, one line of migration guide, and a few lines of test
>>>>>>>>>>>>>> code.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > - https://github.com/apache/spark/pull/46207
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Dongjoon.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Reply via email to