It's a surprise to me to see that someone has different positions in a very short period of time in the community.
Mitch casted +1 for SPARK-44444 and -1 for SPARK-46122. - https://lists.apache.org/thread/4cbkpvc3vr3b6k0wp6lgsw37spdpnqrc - https://lists.apache.org/thread/x09gynt90v3hh5sql1gt9dlcn6m6699p To Mitch, what I'm interested in is the following specifically. > 2. Compatibility: Changing the default behavior could potentially > break existing workflows or pipelines that rely on the current behavior. May I ask you the following questions? A. What is the purpose of the migration guide in the ASF projects? B. Do you claim that there is incompatibility when you have spark.sql.legacy.createHiveTableByDefault=true which is described in the migration guide? C. Do you know that ANSI SQL has new RUNTIME exceptions which are harder than SPARK-46122? D. Or, did you cast +1 for SPARK-44444 because you think there is no breaking change by default? I guess there is some misunderstanding on the proposal. Thanks, Dongjoon. On Fri, Apr 26, 2024 at 12:05 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > I would like to add a side note regarding the discussion process and the > current title of the proposal. The title '[DISCUSS] SPARK-46122: Set > spark.sql.legacy.createHiveTableByDefault to false' focuses on a specific > configuration parameter, which might lead some participants to overlook its > broader implications (as was raised by myself and others). I believe that a > more descriptive title, encompassing the broader discussion on default > behaviours for creating Hive tables in Spark SQL, could enable greater > engagement within the community. This is an important topic that deserves > thorough consideration. > > HTH > > Mich Talebzadeh, > Technologist | Architect | Data Engineer | Generative AI | FinCrime > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Fri, 26 Apr 2024 at 07:13, L. C. Hsieh <vii...@gmail.com> wrote: > >> +1 >> >> On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang <yumw...@apache.org> wrote: >> >>> +1 >>> >>> On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek <ofek.nim...@gmail.com> >>> wrote: >>> >>>> Of course, I can't think of a scenario of thousands of tables with >>>> single in memory Spark cluster with in memory catalog. >>>> Thanks for the help! >>>> >>>> בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh < >>>> mich.talebza...@gmail.com>: >>>> >>>>> >>>>> >>>>> Agreed. In scenarios where most of the interactions with the catalog >>>>> are related to query planning, saving and metadata management, the choice >>>>> of catalog implementation may have less impact on query runtime >>>>> performance. >>>>> This is because the time spent on metadata operations is generally >>>>> minimal compared to the time spent on actual data fetching, processing, >>>>> and >>>>> computation. >>>>> However, if we consider scalability and reliability concerns, >>>>> especially as the size and complexity of data and query workload grow. >>>>> While an in-memory catalog may offer excellent performance for smaller >>>>> workloads, >>>>> it will face limitations in handling larger-scale deployments with >>>>> thousands of tables, partitions, and users. Additionally, durability and >>>>> persistence are crucial considerations, particularly in production >>>>> environments where data integrity >>>>> and availability are crucial. In-memory catalog implementations may >>>>> lack durability, meaning that metadata changes could be lost in the event >>>>> of a system failure or restart. Therefore, while in-memory catalog >>>>> implementations can provide speed and efficiency for certain use cases, we >>>>> ought to consider the requirements for scalability, reliability, and data >>>>> durability when choosing a catalog solution for production deployments. In >>>>> many cases, a combination of in-memory and disk-based catalog solutions >>>>> may >>>>> offer the best balance of performance and resilience for demanding large >>>>> scale workloads. >>>>> >>>>> >>>>> HTH >>>>> >>>>> >>>>> Mich Talebzadeh, >>>>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>>>> London >>>>> United Kingdom >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* The information provided is correct to the best of my >>>>> knowledge but of course cannot be guaranteed . It is essential to note >>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>> expert opinions (Werner >>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>> >>>>> >>>>> On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek <ofek.nim...@gmail.com> >>>>> wrote: >>>>> >>>>>> Of course, but it's in memory and not persisted which is much faster, >>>>>> and as I said- I believe that most of the interaction with it is during >>>>>> the >>>>>> planning and save and not actual query run operations, and they are short >>>>>> and minimal compared to data fetching and manipulation so I don't believe >>>>>> it will have big impact on query run... >>>>>> >>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com>: >>>>>> >>>>>>> Well, I will be surprised because Derby database is single threaded >>>>>>> and won't be much of a use here. >>>>>>> >>>>>>> Most Hive metastore in the commercial world utilise postgres or >>>>>>> Oracle for metastore that are battle proven, replicated and backed up. >>>>>>> >>>>>>> Mich Talebzadeh, >>>>>>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>>>>>> London >>>>>>> United Kingdom >>>>>>> >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>> >>>>>>> >>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Disclaimer:* The information provided is correct to the best of my >>>>>>> knowledge but of course cannot be guaranteed . It is essential to note >>>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>>> expert opinions (Werner >>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>> >>>>>>> >>>>>>> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek <ofek.nim...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Yes, in memory hive catalog backed by local Derby DB. >>>>>>>> And again, I presume that most metadata related parts are during >>>>>>>> planning and not actual run, so I don't see why it should strongly >>>>>>>> affect >>>>>>>> query performance. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> >>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh < >>>>>>>> mich.talebza...@gmail.com>: >>>>>>>> >>>>>>>>> With regard to your point below >>>>>>>>> >>>>>>>>> "The thing I'm missing is this: let's say that the output format I >>>>>>>>> choose is delta lake or iceberg or whatever format that uses parquet. >>>>>>>>> Where >>>>>>>>> does the catalog implementation (which holds metadata afaik, same >>>>>>>>> metadata >>>>>>>>> that iceberg and delta lake save for their tables about their columns) >>>>>>>>> comes into play and why should it affect performance? " >>>>>>>>> >>>>>>>>> The catalog implementation comes into play regardless of the >>>>>>>>> output format chosen (Delta Lake, Iceberg, Parquet, etc.) because it >>>>>>>>> is >>>>>>>>> responsible for managing metadata about the datasets, tables, >>>>>>>>> schemas, and >>>>>>>>> other objects stored in aforementioned formats. Even though Delta >>>>>>>>> Lake and >>>>>>>>> Iceberg have their metadata management mechanisms internally, they >>>>>>>>> still >>>>>>>>> rely on the catalog for providing a unified interface for accessing >>>>>>>>> and >>>>>>>>> manipulating metadata across different storage formats. >>>>>>>>> >>>>>>>>> "Another thing is that if I understand correctly, and I might be >>>>>>>>> totally wrong here, the internal spark catalog is a local >>>>>>>>> installation of >>>>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with >>>>>>>>> anything" >>>>>>>>> >>>>>>>>> .I don't understand this. Do you mean a Derby database? >>>>>>>>> >>>>>>>>> HTH >>>>>>>>> >>>>>>>>> Mich Talebzadeh, >>>>>>>>> Technologist | Architect | Data Engineer | Generative AI | >>>>>>>>> FinCrime >>>>>>>>> London >>>>>>>>> United Kingdom >>>>>>>>> >>>>>>>>> >>>>>>>>> view my Linkedin profile >>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>> >>>>>>>>> >>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> *Disclaimer:* The information provided is correct to the best of >>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to >>>>>>>>> note >>>>>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>>>>> expert opinions (Werner >>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek <ofek.nim...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks for the detailed answer. >>>>>>>>>> The thing I'm missing is this: let's say that the output format I >>>>>>>>>> choose is delta lake or iceberg or whatever format that uses >>>>>>>>>> parquet. Where >>>>>>>>>> does the catalog implementation (which holds metadata afaik, same >>>>>>>>>> metadata >>>>>>>>>> that iceberg and delta lake save for their tables about their >>>>>>>>>> columns) >>>>>>>>>> comes into play and why should it affect performance? >>>>>>>>>> Another thing is that if I understand correctly, and I might be >>>>>>>>>> totally wrong here, the internal spark catalog is a local >>>>>>>>>> installation of >>>>>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do >>>>>>>>>> with >>>>>>>>>> anything. >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh < >>>>>>>>>> mich.talebza...@gmail.com>: >>>>>>>>>> >>>>>>>>>>> My take regarding your question is that your mileage varies so >>>>>>>>>>> to speak. >>>>>>>>>>> >>>>>>>>>>> 1) Hive provides a more mature and widely adopted catalog >>>>>>>>>>> solution that integrates well with other components in the Hadoop >>>>>>>>>>> ecosystem, such as HDFS, HBase, and YARN. IIf you are Hadoop >>>>>>>>>>> centric S(say >>>>>>>>>>> on-premise), using Hive may offer better compatibility and >>>>>>>>>>> interoperability. >>>>>>>>>>> 2) Hive provides a SQL-like interface that is familiar to users >>>>>>>>>>> who are accustomed to traditional RDBMs. If your use case involves >>>>>>>>>>> complex >>>>>>>>>>> SQL queries or existing SQL-based workflows, using Hive may be >>>>>>>>>>> advantageous. >>>>>>>>>>> 3) If you are looking for performance, spark's native catalog >>>>>>>>>>> tends to offer better performance for certain workloads, >>>>>>>>>>> particularly those >>>>>>>>>>> that involve iterative processing or complex data >>>>>>>>>>> transformations.(my >>>>>>>>>>> understanding). Spark's in-memory processing capabilities and >>>>>>>>>>> optimizations >>>>>>>>>>> make it well-suited for interactive analytics and machine learning >>>>>>>>>>> tasks.(my favourite) >>>>>>>>>>> 4) Integration with Spark Workflows: If you primarily use Spark >>>>>>>>>>> for data processing and analytics, using Spark's native catalog may >>>>>>>>>>> simplify workflow management and reduce overhead, Spark's tight >>>>>>>>>>> integration with its catalog allows for seamless interaction with >>>>>>>>>>> Spark >>>>>>>>>>> applications and libraries. >>>>>>>>>>> 5) There seems to be some similarity with spark catalog and >>>>>>>>>>> Databricks unity catalog, so that may favour the choice. >>>>>>>>>>> >>>>>>>>>>> HTH >>>>>>>>>>> >>>>>>>>>>> Mich Talebzadeh, >>>>>>>>>>> Technologist | Architect | Data Engineer | Generative AI | >>>>>>>>>>> FinCrime >>>>>>>>>>> London >>>>>>>>>>> United Kingdom >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> view my Linkedin profile >>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *Disclaimer:* The information provided is correct to the best >>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is >>>>>>>>>>> essential to >>>>>>>>>>> note that, as with any advice, quote "one test result is worth >>>>>>>>>>> one-thousand expert opinions (Werner >>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek <ofek.nim...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I will also appreciate some material that describes the >>>>>>>>>>>> differences between Spark native tables vs hive tables and why >>>>>>>>>>>> each should >>>>>>>>>>>> be used... >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Nimrod >>>>>>>>>>>> >>>>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh < >>>>>>>>>>>> mich.talebza...@gmail.com>: >>>>>>>>>>>> >>>>>>>>>>>>> I see a statement made as below and I quote >>>>>>>>>>>>> >>>>>>>>>>>>> "The proposal of SPARK-46122 is to switch the default value of >>>>>>>>>>>>> this >>>>>>>>>>>>> configuration from `true` to `false` to use Spark native >>>>>>>>>>>>> tables because >>>>>>>>>>>>> we support better." >>>>>>>>>>>>> >>>>>>>>>>>>> Can you please elaborate on the above specifically with regard >>>>>>>>>>>>> to the phrase ".. because >>>>>>>>>>>>> we support better." >>>>>>>>>>>>> >>>>>>>>>>>>> Are you referring to the performance of Spark catalog (I >>>>>>>>>>>>> believe it is internal) or integration with Spark? >>>>>>>>>>>>> >>>>>>>>>>>>> HTH >>>>>>>>>>>>> >>>>>>>>>>>>> Mich Talebzadeh, >>>>>>>>>>>>> Technologist | Architect | Data Engineer | Generative AI | >>>>>>>>>>>>> FinCrime >>>>>>>>>>>>> London >>>>>>>>>>>>> United Kingdom >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> view my Linkedin profile >>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> *Disclaimer:* The information provided is correct to the best >>>>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is >>>>>>>>>>>>> essential to >>>>>>>>>>>>> note that, as with any advice, quote "one test result is >>>>>>>>>>>>> worth one-thousand expert opinions (Werner >>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan <cloud0...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> +1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao <y...@apache.org> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> +1 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-44444. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Kent Yao >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dongjoon Hyun <dongjoon.h...@gmail.com> 于2024年4月25日周四 >>>>>>>>>>>>>>> 14:39写道: >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Hi, All. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > It's great to see community activities to polish 4.0.0 >>>>>>>>>>>>>>> more and more. >>>>>>>>>>>>>>> > Thank you all. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you >>>>>>>>>>>>>>> from the subtasks >>>>>>>>>>>>>>> > of SPARK-44444 (Prepare Apache Spark 4.0.0), >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122 >>>>>>>>>>>>>>> > Set `spark.sql.legacy.createHiveTableByDefault` to >>>>>>>>>>>>>>> `false` by default >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > This legacy configuration is about `CREATE TABLE` SQL >>>>>>>>>>>>>>> syntax without >>>>>>>>>>>>>>> > `USING` and `STORED AS`, which is currently mapped to >>>>>>>>>>>>>>> `Hive` table. >>>>>>>>>>>>>>> > The proposal of SPARK-46122 is to switch the default value >>>>>>>>>>>>>>> of this >>>>>>>>>>>>>>> > configuration from `true` to `false` to use Spark native >>>>>>>>>>>>>>> tables because >>>>>>>>>>>>>>> > we support better. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > In other words, Spark will use the value of >>>>>>>>>>>>>>> `spark.sql.sources.default` >>>>>>>>>>>>>>> > as the table provider instead of `Hive` like the other >>>>>>>>>>>>>>> Spark APIs. Of course, >>>>>>>>>>>>>>> > the users can get all the legacy behavior by setting back >>>>>>>>>>>>>>> to `true`. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Historically, this behavior change was merged once at >>>>>>>>>>>>>>> Apache Spark 3.0.0 >>>>>>>>>>>>>>> > preparation via SPARK-30098 already, but reverted during >>>>>>>>>>>>>>> the 3.0.0 RC period. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as provider >>>>>>>>>>>>>>> for CREATE TABLE >>>>>>>>>>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default >>>>>>>>>>>>>>> datasource as >>>>>>>>>>>>>>> > provider for CREATE TABLE command >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > At Apache Spark 3.1.0, we had another discussion about >>>>>>>>>>>>>>> this and defined it >>>>>>>>>>>>>>> > as one of legacy behavior via this configuration via >>>>>>>>>>>>>>> reused ID, SPARK-30098. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > 2020-12-01: >>>>>>>>>>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 >>>>>>>>>>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use default >>>>>>>>>>>>>>> datasource as >>>>>>>>>>>>>>> > provider for CREATE TABLE command >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Last year, we received two additional requests twice to >>>>>>>>>>>>>>> switch this because >>>>>>>>>>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for >>>>>>>>>>>>>>> the future direction. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > 2023-02-27: SPARK-42603 as an independent idea. >>>>>>>>>>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 >>>>>>>>>>>>>>> idea >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > WDYT? The technical scope is defined in the following PR >>>>>>>>>>>>>>> which is one line of main >>>>>>>>>>>>>>> > code, one line of migration guide, and a few lines of test >>>>>>>>>>>>>>> code. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > - https://github.com/apache/spark/pull/46207 >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Dongjoon. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>