Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Hyukjin Kwon Mon, 29 Apr 2024 14:31:39 -0700

Mich, It is a legacy config we should get rid of in the end, and it has
been tested in production for very long time. Spark should create a Spark
table by default.


On Tue, Apr 30, 2024 at 5:38 AM Mich Talebzadeh <[email protected]>
wrote:

> Your point
>
> ".. t's a surprise to me to see that someone has different positions in a
> very short period of time in the community...."
>
> Well, I have  been with Spark since 2015 and this is the article in the
> medium dated February 7, 2016 with regard to both Hive and Spark and also
> presented in Hortonworks meet-up.
>
> Hive on Spark Engine Versus Spark Using Hive Metastore
> <https://www.linkedin.com/pulse/hive-spark-engine-versus-using-metastore-mich-talebzadeh-ph-d-/>
>
> With regard to why I castred +1 votre for one and -1 for the other, I
> think it is my prerogative how  I vote and we leave it at that.,
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 29 Apr 2024 at 17:32, Dongjoon Hyun <[email protected]>
> wrote:
>
>> It's a surprise to me to see that someone has different positions
>> in a very short period of time in the community.
>>
>> Mitch casted +1 for SPARK-44444 and -1 for SPARK-46122.
>> - https://lists.apache.org/thread/4cbkpvc3vr3b6k0wp6lgsw37spdpnqrc
>> - https://lists.apache.org/thread/x09gynt90v3hh5sql1gt9dlcn6m6699p
>>
>> To Mitch, what I'm interested in is the following specifically.
>> > 2. Compatibility: Changing the default behavior could potentially
>> >  break existing workflows or pipelines that rely on the current
>> behavior.
>>
>> May I ask you the following questions?
>> A. What is the purpose of the migration guide in the ASF projects?
>>
>> B. Do you claim that there is incompatibility when you have
>>      spark.sql.legacy.createHiveTableByDefault=true which is described
>>      in the migration guide?
>>
>> C. Do you know that ANSI SQL has new RUNTIME exceptions
>>      which are harder than SPARK-46122?
>>
>> D. Or, did you cast +1 for SPARK-44444 because
>>      you think there is no breaking change by default?
>>
>> I guess there is some misunderstanding on the proposal.
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Fri, Apr 26, 2024 at 12:05 PM Mich Talebzadeh <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> I would like to add a side note regarding the discussion process and the
>>> current title of the proposal. The title '[DISCUSS] SPARK-46122: Set
>>> spark.sql.legacy.createHiveTableByDefault to false' focuses on a specific
>>> configuration parameter, which might lead some participants to overlook its
>>> broader implications (as was raised by myself and others). I believe that a
>>> more descriptive title, encompassing the broader discussion on default
>>> behaviours for creating Hive tables in Spark SQL, could enable greater
>>> engagement within the community. This is an important topic that deserves
>>> thorough consideration.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>
>>>
>>> On Fri, 26 Apr 2024 at 07:13, L. C. Hsieh <[email protected]> wrote:
>>>
>>>> +1
>>>>
>>>> On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang <[email protected]> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Of course, I can't think of a scenario of thousands of tables with
>>>>>> single in memory Spark cluster with in memory catalog.
>>>>>> Thanks for the help!
>>>>>>
>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏<
>>>>>> [email protected]>:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Agreed. In scenarios where most of the interactions with the catalog
>>>>>>> are related to query planning, saving and metadata management, the 
>>>>>>> choice
>>>>>>> of catalog implementation may have less impact on query runtime 
>>>>>>> performance.
>>>>>>> This is because the time spent on metadata operations is generally
>>>>>>> minimal compared to the time spent on actual data fetching, processing, 
>>>>>>> and
>>>>>>> computation.
>>>>>>> However, if we consider scalability and reliability concerns,
>>>>>>> especially as the size and complexity of data and query workload grow.
>>>>>>> While an in-memory catalog may offer excellent performance for smaller
>>>>>>> workloads,
>>>>>>> it will face limitations in handling larger-scale deployments with
>>>>>>> thousands of tables, partitions, and users. Additionally, durability and
>>>>>>> persistence are crucial considerations, particularly in production
>>>>>>> environments where data integrity
>>>>>>> and availability are crucial. In-memory catalog implementations may
>>>>>>> lack durability, meaning that metadata changes could be lost in the 
>>>>>>> event
>>>>>>> of a system failure or restart. Therefore, while in-memory catalog
>>>>>>> implementations can provide speed and efficiency for certain use cases, 
>>>>>>> we
>>>>>>> ought to consider the requirements for scalability, reliability, and 
>>>>>>> data
>>>>>>> durability when choosing a catalog solution for production deployments. 
>>>>>>> In
>>>>>>> many cases, a combination of in-memory and disk-based catalog solutions 
>>>>>>> may
>>>>>>> offer the best balance of performance and resilience for demanding large
>>>>>>> scale workloads.
>>>>>>>
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>>
>>>>>>> Mich Talebzadeh,
>>>>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>>>>> London
>>>>>>> United Kingdom
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>>> expert opinions (Werner
>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Of course, but it's in memory and not persisted which is much
>>>>>>>> faster, and as I said- I believe that most of the interaction with it 
>>>>>>>> is
>>>>>>>> during the planning and save and not actual query run operations, and 
>>>>>>>> they
>>>>>>>> are short and minimal compared to data fetching and manipulation so I 
>>>>>>>> don't
>>>>>>>> believe it will have big impact on query run...
>>>>>>>>
>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh ‏<
>>>>>>>> [email protected]>:
>>>>>>>>
>>>>>>>>> Well, I will be surprised because Derby database is single
>>>>>>>>> threaded and won't be much of a use here.
>>>>>>>>>
>>>>>>>>> Most Hive metastore in the commercial world utilise postgres or
>>>>>>>>> Oracle for metastore that are battle proven, replicated and backed up.
>>>>>>>>>
>>>>>>>>> Mich Talebzadeh,
>>>>>>>>> Technologist | Architect | Data Engineer  | Generative AI |
>>>>>>>>> FinCrime
>>>>>>>>> London
>>>>>>>>> United Kingdom
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    view my Linkedin profile
>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Disclaimer:* The information provided is correct to the best of
>>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to 
>>>>>>>>> note
>>>>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>>>>> expert opinions (Werner
>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Yes, in memory hive catalog backed by local Derby DB.
>>>>>>>>>> And again, I presume that most metadata related parts are during
>>>>>>>>>> planning and not actual run, so I don't see why it should strongly 
>>>>>>>>>> affect
>>>>>>>>>> query performance.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
>>>>>>>>>> [email protected]>:
>>>>>>>>>>
>>>>>>>>>>> With regard to your point below
>>>>>>>>>>>
>>>>>>>>>>> "The thing I'm missing is this: let's say that the output format
>>>>>>>>>>> I choose is delta lake or iceberg or whatever format that uses 
>>>>>>>>>>> parquet.
>>>>>>>>>>> Where does the catalog implementation (which holds metadata afaik, 
>>>>>>>>>>> same
>>>>>>>>>>> metadata that iceberg and delta lake save for their tables about 
>>>>>>>>>>> their
>>>>>>>>>>> columns) comes into play and why should it affect performance? "
>>>>>>>>>>>
>>>>>>>>>>> The catalog implementation comes into play regardless of the
>>>>>>>>>>> output format chosen (Delta Lake, Iceberg, Parquet, etc.) because 
>>>>>>>>>>> it is
>>>>>>>>>>> responsible for managing metadata about the datasets, tables, 
>>>>>>>>>>> schemas, and
>>>>>>>>>>> other objects stored in aforementioned formats. Even though Delta 
>>>>>>>>>>> Lake and
>>>>>>>>>>> Iceberg have their metadata management mechanisms internally, they 
>>>>>>>>>>> still
>>>>>>>>>>> rely on the catalog for providing a unified interface for accessing 
>>>>>>>>>>> and
>>>>>>>>>>> manipulating metadata across different storage formats.
>>>>>>>>>>>
>>>>>>>>>>> "Another thing is that if I understand correctly, and I might be
>>>>>>>>>>> totally wrong here, the internal spark catalog is a local 
>>>>>>>>>>> installation of
>>>>>>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do 
>>>>>>>>>>> with
>>>>>>>>>>> anything"
>>>>>>>>>>>
>>>>>>>>>>> .I don't understand this. Do you mean a Derby database?
>>>>>>>>>>>
>>>>>>>>>>> HTH
>>>>>>>>>>>
>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>> Technologist | Architect | Data Engineer  | Generative AI |
>>>>>>>>>>> FinCrime
>>>>>>>>>>> London
>>>>>>>>>>> United Kingdom
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Disclaimer:* The information provided is correct to the best
>>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is 
>>>>>>>>>>> essential to
>>>>>>>>>>> note that, as with any advice, quote "one test result is worth
>>>>>>>>>>> one-thousand expert opinions (Werner
>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the detailed answer.
>>>>>>>>>>>> The thing I'm missing is this: let's say that the output format
>>>>>>>>>>>> I choose is delta lake or iceberg or whatever format that uses 
>>>>>>>>>>>> parquet.
>>>>>>>>>>>> Where does the catalog implementation (which holds metadata afaik, 
>>>>>>>>>>>> same
>>>>>>>>>>>> metadata that iceberg and delta lake save for their tables about 
>>>>>>>>>>>> their
>>>>>>>>>>>> columns) comes into play and why should it affect performance?
>>>>>>>>>>>> Another thing is that if I understand correctly, and I might be
>>>>>>>>>>>> totally wrong here, the internal spark catalog is a local 
>>>>>>>>>>>> installation of
>>>>>>>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do 
>>>>>>>>>>>> with
>>>>>>>>>>>> anything.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
>>>>>>>>>>>> [email protected]>:
>>>>>>>>>>>>
>>>>>>>>>>>>> My take regarding your question is that your mileage varies so
>>>>>>>>>>>>> to speak.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) Hive provides a more mature and widely adopted catalog
>>>>>>>>>>>>> solution that integrates well with other components in the Hadoop
>>>>>>>>>>>>> ecosystem, such as HDFS, HBase, and YARN. IIf you are Hadoop 
>>>>>>>>>>>>> centric S(say
>>>>>>>>>>>>> on-premise), using Hive may offer better compatibility and
>>>>>>>>>>>>> interoperability.
>>>>>>>>>>>>> 2) Hive provides a SQL-like interface that is familiar to
>>>>>>>>>>>>> users who are accustomed to traditional RDBMs. If your use case 
>>>>>>>>>>>>> involves
>>>>>>>>>>>>> complex SQL queries or existing SQL-based workflows, using Hive 
>>>>>>>>>>>>> may be
>>>>>>>>>>>>> advantageous.
>>>>>>>>>>>>> 3) If you are looking for performance, spark's native catalog
>>>>>>>>>>>>> tends to offer better performance for certain workloads, 
>>>>>>>>>>>>> particularly those
>>>>>>>>>>>>> that involve iterative processing or complex data 
>>>>>>>>>>>>> transformations.(my
>>>>>>>>>>>>> understanding). Spark's in-memory processing capabilities and 
>>>>>>>>>>>>> optimizations
>>>>>>>>>>>>> make it well-suited for interactive analytics and machine learning
>>>>>>>>>>>>> tasks.(my favourite)
>>>>>>>>>>>>> 4) Integration with Spark Workflows: If you primarily use
>>>>>>>>>>>>> Spark for data processing and analytics, using Spark's native 
>>>>>>>>>>>>> catalog may
>>>>>>>>>>>>> simplify workflow management and reduce overhead, Spark's  tight
>>>>>>>>>>>>> integration with its catalog allows for seamless interaction with 
>>>>>>>>>>>>> Spark
>>>>>>>>>>>>> applications and libraries.
>>>>>>>>>>>>> 5) There seems to be some similarity with spark catalog and
>>>>>>>>>>>>> Databricks unity catalog, so that may favour the choice.
>>>>>>>>>>>>>
>>>>>>>>>>>>> HTH
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>>>> Technologist | Architect | Data Engineer  | Generative AI |
>>>>>>>>>>>>> FinCrime
>>>>>>>>>>>>> London
>>>>>>>>>>>>> United Kingdom
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Disclaimer:* The information provided is correct to the best
>>>>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is 
>>>>>>>>>>>>> essential to
>>>>>>>>>>>>> note that, as with any advice, quote "one test result is
>>>>>>>>>>>>> worth one-thousand expert opinions (Werner
>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I will also appreciate some material that describes the
>>>>>>>>>>>>>> differences between Spark native tables vs hive tables and why 
>>>>>>>>>>>>>> each should
>>>>>>>>>>>>>> be used...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Nimrod
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏<
>>>>>>>>>>>>>> [email protected]>:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I see a statement made as below  and I quote
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "The proposal of SPARK-46122 is to switch the default value
>>>>>>>>>>>>>>> of this
>>>>>>>>>>>>>>> configuration from `true` to `false` to use Spark native
>>>>>>>>>>>>>>> tables because
>>>>>>>>>>>>>>> we support better."
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can you please elaborate on the above specifically with
>>>>>>>>>>>>>>> regard to the phrase ".. because
>>>>>>>>>>>>>>> we support better."
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Are you referring to the performance of Spark catalog (I
>>>>>>>>>>>>>>> believe it is internal) or integration with Spark?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> HTH
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>>>>>> Technologist | Architect | Data Engineer  | Generative AI |
>>>>>>>>>>>>>>> FinCrime
>>>>>>>>>>>>>>> London
>>>>>>>>>>>>>>> United Kingdom
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Disclaimer:* The information provided is correct to the
>>>>>>>>>>>>>>> best of my knowledge but of course cannot be guaranteed . It is 
>>>>>>>>>>>>>>> essential
>>>>>>>>>>>>>>> to note that, as with any advice, quote "one test result is
>>>>>>>>>>>>>>> worth one-thousand expert opinions (Werner
>>>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao <[email protected]>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-44444.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Kent Yao
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Dongjoon Hyun <[email protected]> 于2024年4月25日周四
>>>>>>>>>>>>>>>>> 14:39写道：
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > Hi, All.
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > It's great to see community activities to polish 4.0.0
>>>>>>>>>>>>>>>>> more and more.
>>>>>>>>>>>>>>>>> > Thank you all.
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you
>>>>>>>>>>>>>>>>> from the subtasks
>>>>>>>>>>>>>>>>> > of SPARK-44444 (Prepare Apache Spark 4.0.0),
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122
>>>>>>>>>>>>>>>>> >    Set `spark.sql.legacy.createHiveTableByDefault` to
>>>>>>>>>>>>>>>>> `false` by default
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > This legacy configuration is about `CREATE TABLE` SQL
>>>>>>>>>>>>>>>>> syntax without
>>>>>>>>>>>>>>>>> > `USING` and `STORED AS`, which is currently mapped to
>>>>>>>>>>>>>>>>> `Hive` table.
>>>>>>>>>>>>>>>>> > The proposal of SPARK-46122 is to switch the default
>>>>>>>>>>>>>>>>> value of this
>>>>>>>>>>>>>>>>> > configuration from `true` to `false` to use Spark native
>>>>>>>>>>>>>>>>> tables because
>>>>>>>>>>>>>>>>> > we support better.
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > In other words, Spark will use the value of
>>>>>>>>>>>>>>>>> `spark.sql.sources.default`
>>>>>>>>>>>>>>>>> > as the table provider instead of `Hive` like the other
>>>>>>>>>>>>>>>>> Spark APIs. Of course,
>>>>>>>>>>>>>>>>> > the users can get all the legacy behavior by setting
>>>>>>>>>>>>>>>>> back to `true`.
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > Historically, this behavior change was merged once at
>>>>>>>>>>>>>>>>> Apache Spark 3.0.0
>>>>>>>>>>>>>>>>> > preparation via SPARK-30098 already, but reverted during
>>>>>>>>>>>>>>>>> the 3.0.0 RC period.
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as
>>>>>>>>>>>>>>>>> provider for CREATE TABLE
>>>>>>>>>>>>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default
>>>>>>>>>>>>>>>>> datasource as
>>>>>>>>>>>>>>>>> >             provider for CREATE TABLE command
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > At Apache Spark 3.1.0, we had another discussion about
>>>>>>>>>>>>>>>>> this and defined it
>>>>>>>>>>>>>>>>> > as one of legacy behavior via this configuration via
>>>>>>>>>>>>>>>>> reused ID, SPARK-30098.
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > 2020-12-01:
>>>>>>>>>>>>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
>>>>>>>>>>>>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use
>>>>>>>>>>>>>>>>> default datasource as
>>>>>>>>>>>>>>>>> >             provider for CREATE TABLE command
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > Last year, we received two additional requests twice to
>>>>>>>>>>>>>>>>> switch this because
>>>>>>>>>>>>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for
>>>>>>>>>>>>>>>>> the future direction.
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > 2023-02-27: SPARK-42603 as an independent idea.
>>>>>>>>>>>>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0
>>>>>>>>>>>>>>>>> idea
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > WDYT? The technical scope is defined in the following PR
>>>>>>>>>>>>>>>>> which is one line of main
>>>>>>>>>>>>>>>>> > code, one line of migration guide, and a few lines of
>>>>>>>>>>>>>>>>> test code.
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > - https://github.com/apache/spark/pull/46207
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > Dongjoon.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>> To unsubscribe e-mail: [email protected]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Reply via email to