Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Hyukjin Kwon
Mich, It is a legacy config we should get rid of in the end, and it has
been tested in production for very long time. Spark should create a Spark
table by default.

On Tue, Apr 30, 2024 at 5:38 AM Mich Talebzadeh 
wrote:

> Your point
>
> ".. t's a surprise to me to see that someone has different positions in a
> very short period of time in the community"
>
> Well, I have  been with Spark since 2015 and this is the article in the
> medium dated February 7, 2016 with regard to both Hive and Spark and also
> presented in Hortonworks meet-up.
>
> Hive on Spark Engine Versus Spark Using Hive Metastore
> 
>
> With regard to why I castred +1 votre for one and -1 for the other, I
> think it is my prerogative how  I vote and we leave it at that.,
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 29 Apr 2024 at 17:32, Dongjoon Hyun 
> wrote:
>
>> It's a surprise to me to see that someone has different positions
>> in a very short period of time in the community.
>>
>> Mitch casted +1 for SPARK-4 and -1 for SPARK-46122.
>> - https://lists.apache.org/thread/4cbkpvc3vr3b6k0wp6lgsw37spdpnqrc
>> - https://lists.apache.org/thread/x09gynt90v3hh5sql1gt9dlcn6m6699p
>>
>> To Mitch, what I'm interested in is the following specifically.
>> > 2. Compatibility: Changing the default behavior could potentially
>> >  break existing workflows or pipelines that rely on the current
>> behavior.
>>
>> May I ask you the following questions?
>> A. What is the purpose of the migration guide in the ASF projects?
>>
>> B. Do you claim that there is incompatibility when you have
>>  spark.sql.legacy.createHiveTableByDefault=true which is described
>>  in the migration guide?
>>
>> C. Do you know that ANSI SQL has new RUNTIME exceptions
>>  which are harder than SPARK-46122?
>>
>> D. Or, did you cast +1 for SPARK-4 because
>>  you think there is no breaking change by default?
>>
>> I guess there is some misunderstanding on the proposal.
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Fri, Apr 26, 2024 at 12:05 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I would like to add a side note regarding the discussion process and the
>>> current title of the proposal. The title '[DISCUSS] SPARK-46122: Set
>>> spark.sql.legacy.createHiveTableByDefault to false' focuses on a specific
>>> configuration parameter, which might lead some participants to overlook its
>>> broader implications (as was raised by myself and others). I believe that a
>>> more descriptive title, encompassing the broader discussion on default
>>> behaviours for creating Hive tables in Spark SQL, could enable greater
>>> engagement within the community. This is an important topic that deserves
>>> thorough consideration.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Fri, 26 Apr 2024 at 07:13, L. C. Hsieh  wrote:
>>>
 +1

 On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang  wrote:

> +1
>
> On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek 
> wrote:
>
>> Of course, I can't think of a scenario of thousands of tables with
>> single in memory Spark cluster with in memory catalog.
>> Thanks for the help!
>>
>> בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏<
>> mich.talebza...@gmail.com>:
>>
>>>
>>>
>>> Agreed. In scenarios where most of the interactions with the catalog
>>> are related to query planning, saving and metadata management, the 
>>> choice
>>> of catalog implementation may have less impact on query runtime 
>>> performance.
>>> This is because the time spent on metadata operations is generally
>>> minimal 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Dongjoon Hyun
? I'm not sure why you think in that direction.

What I wrote was the following.

- You voted +1 for SPARK-4 on April 14th
  (https://lists.apache.org/thread/tp92yzf8y4yjfk6r3dkqjtlb060g82sy) 
- You voted -1 for SPARK-46122 on April 26th.
  (https://lists.apache.org/thread/2ybq1jb19j0c52rgo43zfd9br1yhtfj8)

You showed a dual-standard for the same kind of SQL votes in two weeks.

We always count all votes from all contributors
in order to keep the record of all comprehensive feedbacks.

Dongjoon.

On 2024/04/29 17:49:36 Mich Talebzadeh wrote:
> Your point
> 
> ".. t's a surprise to me to see that someone has different positions in a
> very short period of time in the community"
> 
> Well, I have  been with Spark since 2015 and this is the article in the
> medium dated February 7, 2016 with regard to both Hive and Spark and also
> presented in Hortonworks meet-up.
> 
> Hive on Spark Engine Versus Spark Using Hive Metastore
> 
> 
> With regard to why I castred +1 votre for one and -1 for the other, I think
> it is my prerogative how  I vote and we leave it at that.,
> 
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
> 
> 
>view my Linkedin profile
> 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
> 
> 
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
> 
> 
> On Mon, 29 Apr 2024 at 17:32, Dongjoon Hyun  wrote:
> 
> > It's a surprise to me to see that someone has different positions
> > in a very short period of time in the community.
> >
> > Mitch casted +1 for SPARK-4 and -1 for SPARK-46122.
> > - https://lists.apache.org/thread/4cbkpvc3vr3b6k0wp6lgsw37spdpnqrc
> > - https://lists.apache.org/thread/x09gynt90v3hh5sql1gt9dlcn6m6699p
> >
> > To Mitch, what I'm interested in is the following specifically.
> > > 2. Compatibility: Changing the default behavior could potentially
> > >  break existing workflows or pipelines that rely on the current behavior.
> >
> > May I ask you the following questions?
> > A. What is the purpose of the migration guide in the ASF projects?
> >
> > B. Do you claim that there is incompatibility when you have
> >  spark.sql.legacy.createHiveTableByDefault=true which is described
> >  in the migration guide?
> >
> > C. Do you know that ANSI SQL has new RUNTIME exceptions
> >  which are harder than SPARK-46122?
> >
> > D. Or, did you cast +1 for SPARK-4 because
> >  you think there is no breaking change by default?
> >
> > I guess there is some misunderstanding on the proposal.
> >
> > Thanks,
> > Dongjoon.
> >
> >
> > On Fri, Apr 26, 2024 at 12:05 PM Mich Talebzadeh <
> > mich.talebza...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> I would like to add a side note regarding the discussion process and the
> >> current title of the proposal. The title '[DISCUSS] SPARK-46122: Set
> >> spark.sql.legacy.createHiveTableByDefault to false' focuses on a specific
> >> configuration parameter, which might lead some participants to overlook its
> >> broader implications (as was raised by myself and others). I believe that a
> >> more descriptive title, encompassing the broader discussion on default
> >> behaviours for creating Hive tables in Spark SQL, could enable greater
> >> engagement within the community. This is an important topic that deserves
> >> thorough consideration.
> >>
> >> HTH
> >>
> >> Mich Talebzadeh,
> >> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> >> London
> >> United Kingdom
> >>
> >>
> >>view my Linkedin profile
> >> 
> >>
> >>
> >>  https://en.everybodywiki.com/Mich_Talebzadeh
> >>
> >>
> >>
> >> *Disclaimer:* The information provided is correct to the best of my
> >> knowledge but of course cannot be guaranteed . It is essential to note
> >> that, as with any advice, quote "one test result is worth one-thousand
> >> expert opinions (Werner
> >> Von Braun
> >> )".
> >>
> >>
> >> On Fri, 26 Apr 2024 at 07:13, L. C. Hsieh  wrote:
> >>
> >>> +1
> >>>
> >>> On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang  wrote:
> >>>
>  +1
> 
>  On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek 
>  wrote:
> 
> > Of course, I can't think of a scenario of thousands of tables with
> > single in memory Spark cluster with in memory catalog.
> > Thanks for the help!
> >
> > בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Mich Talebzadeh
Your point

".. t's a surprise to me to see that someone has different positions in a
very short period of time in the community"

Well, I have  been with Spark since 2015 and this is the article in the
medium dated February 7, 2016 with regard to both Hive and Spark and also
presented in Hortonworks meet-up.

Hive on Spark Engine Versus Spark Using Hive Metastore


With regard to why I castred +1 votre for one and -1 for the other, I think
it is my prerogative how  I vote and we leave it at that.,

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 29 Apr 2024 at 17:32, Dongjoon Hyun  wrote:

> It's a surprise to me to see that someone has different positions
> in a very short period of time in the community.
>
> Mitch casted +1 for SPARK-4 and -1 for SPARK-46122.
> - https://lists.apache.org/thread/4cbkpvc3vr3b6k0wp6lgsw37spdpnqrc
> - https://lists.apache.org/thread/x09gynt90v3hh5sql1gt9dlcn6m6699p
>
> To Mitch, what I'm interested in is the following specifically.
> > 2. Compatibility: Changing the default behavior could potentially
> >  break existing workflows or pipelines that rely on the current behavior.
>
> May I ask you the following questions?
> A. What is the purpose of the migration guide in the ASF projects?
>
> B. Do you claim that there is incompatibility when you have
>  spark.sql.legacy.createHiveTableByDefault=true which is described
>  in the migration guide?
>
> C. Do you know that ANSI SQL has new RUNTIME exceptions
>  which are harder than SPARK-46122?
>
> D. Or, did you cast +1 for SPARK-4 because
>  you think there is no breaking change by default?
>
> I guess there is some misunderstanding on the proposal.
>
> Thanks,
> Dongjoon.
>
>
> On Fri, Apr 26, 2024 at 12:05 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> I would like to add a side note regarding the discussion process and the
>> current title of the proposal. The title '[DISCUSS] SPARK-46122: Set
>> spark.sql.legacy.createHiveTableByDefault to false' focuses on a specific
>> configuration parameter, which might lead some participants to overlook its
>> broader implications (as was raised by myself and others). I believe that a
>> more descriptive title, encompassing the broader discussion on default
>> behaviours for creating Hive tables in Spark SQL, could enable greater
>> engagement within the community. This is an important topic that deserves
>> thorough consideration.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Fri, 26 Apr 2024 at 07:13, L. C. Hsieh  wrote:
>>
>>> +1
>>>
>>> On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang  wrote:
>>>
 +1

 On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek 
 wrote:

> Of course, I can't think of a scenario of thousands of tables with
> single in memory Spark cluster with in memory catalog.
> Thanks for the help!
>
> בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏<
> mich.talebza...@gmail.com>:
>
>>
>>
>> Agreed. In scenarios where most of the interactions with the catalog
>> are related to query planning, saving and metadata management, the choice
>> of catalog implementation may have less impact on query runtime 
>> performance.
>> This is because the time spent on metadata operations is generally
>> minimal compared to the time spent on actual data fetching, processing, 
>> and
>> computation.
>> However, if we consider scalability and reliability concerns,
>> especially as the size and complexity of data and query workload grow.
>> While an in-memory catalog may offer excellent performance for smaller
>> workloads,
>> it will face limitations in handling larger-scale 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Dongjoon Hyun
It's a surprise to me to see that someone has different positions
in a very short period of time in the community.

Mitch casted +1 for SPARK-4 and -1 for SPARK-46122.
- https://lists.apache.org/thread/4cbkpvc3vr3b6k0wp6lgsw37spdpnqrc
- https://lists.apache.org/thread/x09gynt90v3hh5sql1gt9dlcn6m6699p

To Mitch, what I'm interested in is the following specifically.
> 2. Compatibility: Changing the default behavior could potentially
>  break existing workflows or pipelines that rely on the current behavior.

May I ask you the following questions?
A. What is the purpose of the migration guide in the ASF projects?

B. Do you claim that there is incompatibility when you have
 spark.sql.legacy.createHiveTableByDefault=true which is described
 in the migration guide?

C. Do you know that ANSI SQL has new RUNTIME exceptions
 which are harder than SPARK-46122?

D. Or, did you cast +1 for SPARK-4 because
 you think there is no breaking change by default?

I guess there is some misunderstanding on the proposal.

Thanks,
Dongjoon.


On Fri, Apr 26, 2024 at 12:05 PM Mich Talebzadeh 
wrote:

> Hi,
>
> I would like to add a side note regarding the discussion process and the
> current title of the proposal. The title '[DISCUSS] SPARK-46122: Set
> spark.sql.legacy.createHiveTableByDefault to false' focuses on a specific
> configuration parameter, which might lead some participants to overlook its
> broader implications (as was raised by myself and others). I believe that a
> more descriptive title, encompassing the broader discussion on default
> behaviours for creating Hive tables in Spark SQL, could enable greater
> engagement within the community. This is an important topic that deserves
> thorough consideration.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Fri, 26 Apr 2024 at 07:13, L. C. Hsieh  wrote:
>
>> +1
>>
>> On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang  wrote:
>>
>>> +1
>>>
>>> On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek 
>>> wrote:
>>>
 Of course, I can't think of a scenario of thousands of tables with
 single in memory Spark cluster with in memory catalog.
 Thanks for the help!

 בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏<
 mich.talebza...@gmail.com>:

>
>
> Agreed. In scenarios where most of the interactions with the catalog
> are related to query planning, saving and metadata management, the choice
> of catalog implementation may have less impact on query runtime 
> performance.
> This is because the time spent on metadata operations is generally
> minimal compared to the time spent on actual data fetching, processing, 
> and
> computation.
> However, if we consider scalability and reliability concerns,
> especially as the size and complexity of data and query workload grow.
> While an in-memory catalog may offer excellent performance for smaller
> workloads,
> it will face limitations in handling larger-scale deployments with
> thousands of tables, partitions, and users. Additionally, durability and
> persistence are crucial considerations, particularly in production
> environments where data integrity
> and availability are crucial. In-memory catalog implementations may
> lack durability, meaning that metadata changes could be lost in the event
> of a system failure or restart. Therefore, while in-memory catalog
> implementations can provide speed and efficiency for certain use cases, we
> ought to consider the requirements for scalability, reliability, and data
> durability when choosing a catalog solution for production deployments. In
> many cases, a combination of in-memory and disk-based catalog solutions 
> may
> offer the best balance of performance and resilience for demanding large
> scale workloads.
>
>
> HTH
>
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Dongjoon Hyun
Thank you, Kent, Wenchen, Mich, Nimrod, Yuming, LiangChi. I'll start a vote.

To Mich, for your question, Apache Spark has a long history of converting
Hive-provider tables into Spark's datasource tables to handle better in a
Spark way.

> Can you please elaborate on the above specifically with regard to the
phrase
> ".. because we support better."

Here are the subset of configurations you can take a look at.
- spark.sql.hive.convertMetastoreParquet (`true` since Spark 1.3.0)
- spark.sql.hive.convertMetastoreOrc (`true` since Spark 2.4.0)
- spark.sql.hive.convertInsertingPartitionedTable (`true` since Spark 3.0.0)
- spark.sql.hive.convertMetastoreInsertDir (`true` since Spark 3.3.0)
- spark.sql.hive.convertInsertingUnpartitionedTable (`true` since 4.0.0)

Dongjoon.


On Fri, Apr 26, 2024 at 12:24 AM L. C. Hsieh  wrote:

> +1
>
> On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang  wrote:
>
>> +1
>>
>> On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek 
>> wrote:
>>
>>> Of course, I can't think of a scenario of thousands of tables with
>>> single in memory Spark cluster with in memory catalog.
>>> Thanks for the help!
>>>
>>> בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏<
>>> mich.talebza...@gmail.com>:
>>>


 Agreed. In scenarios where most of the interactions with the catalog
 are related to query planning, saving and metadata management, the choice
 of catalog implementation may have less impact on query runtime 
 performance.
 This is because the time spent on metadata operations is generally
 minimal compared to the time spent on actual data fetching, processing, and
 computation.
 However, if we consider scalability and reliability concerns,
 especially as the size and complexity of data and query workload grow.
 While an in-memory catalog may offer excellent performance for smaller
 workloads,
 it will face limitations in handling larger-scale deployments with
 thousands of tables, partitions, and users. Additionally, durability and
 persistence are crucial considerations, particularly in production
 environments where data integrity
 and availability are crucial. In-memory catalog implementations may
 lack durability, meaning that metadata changes could be lost in the event
 of a system failure or restart. Therefore, while in-memory catalog
 implementations can provide speed and efficiency for certain use cases, we
 ought to consider the requirements for scalability, reliability, and data
 durability when choosing a catalog solution for production deployments. In
 many cases, a combination of in-memory and disk-based catalog solutions may
 offer the best balance of performance and resilience for demanding large
 scale workloads.


 HTH


 Mich Talebzadeh,
 Technologist | Architect | Data Engineer  | Generative AI | FinCrime
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* The information provided is correct to the best of my
 knowledge but of course cannot be guaranteed . It is essential to note
 that, as with any advice, quote "one test result is worth one-thousand
 expert opinions (Werner
 Von Braun
 )".


 On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek 
 wrote:

> Of course, but it's in memory and not persisted which is much faster,
> and as I said- I believe that most of the interaction with it is during 
> the
> planning and save and not actual query run operations, and they are short
> and minimal compared to data fetching and manipulation so I don't believe
> it will have big impact on query run...
>
> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh ‏<
> mich.talebza...@gmail.com>:
>
>> Well, I will be surprised because Derby database is single threaded
>> and won't be much of a use here.
>>
>> Most Hive metastore in the commercial world utilise postgres or
>> Oracle for metastore that are battle proven, replicated and backed up.
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Mich Talebzadeh
Hi,

I would like to add a side note regarding the discussion process and the
current title of the proposal. The title '[DISCUSS] SPARK-46122: Set
spark.sql.legacy.createHiveTableByDefault to false' focuses on a specific
configuration parameter, which might lead some participants to overlook its
broader implications (as was raised by myself and others). I believe that a
more descriptive title, encompassing the broader discussion on default
behaviours for creating Hive tables in Spark SQL, could enable greater
engagement within the community. This is an important topic that deserves
thorough consideration.

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Fri, 26 Apr 2024 at 07:13, L. C. Hsieh  wrote:

> +1
>
> On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang  wrote:
>
>> +1
>>
>> On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek 
>> wrote:
>>
>>> Of course, I can't think of a scenario of thousands of tables with
>>> single in memory Spark cluster with in memory catalog.
>>> Thanks for the help!
>>>
>>> בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏<
>>> mich.talebza...@gmail.com>:
>>>


 Agreed. In scenarios where most of the interactions with the catalog
 are related to query planning, saving and metadata management, the choice
 of catalog implementation may have less impact on query runtime 
 performance.
 This is because the time spent on metadata operations is generally
 minimal compared to the time spent on actual data fetching, processing, and
 computation.
 However, if we consider scalability and reliability concerns,
 especially as the size and complexity of data and query workload grow.
 While an in-memory catalog may offer excellent performance for smaller
 workloads,
 it will face limitations in handling larger-scale deployments with
 thousands of tables, partitions, and users. Additionally, durability and
 persistence are crucial considerations, particularly in production
 environments where data integrity
 and availability are crucial. In-memory catalog implementations may
 lack durability, meaning that metadata changes could be lost in the event
 of a system failure or restart. Therefore, while in-memory catalog
 implementations can provide speed and efficiency for certain use cases, we
 ought to consider the requirements for scalability, reliability, and data
 durability when choosing a catalog solution for production deployments. In
 many cases, a combination of in-memory and disk-based catalog solutions may
 offer the best balance of performance and resilience for demanding large
 scale workloads.


 HTH


 Mich Talebzadeh,
 Technologist | Architect | Data Engineer  | Generative AI | FinCrime
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* The information provided is correct to the best of my
 knowledge but of course cannot be guaranteed . It is essential to note
 that, as with any advice, quote "one test result is worth one-thousand
 expert opinions (Werner
 Von Braun
 )".


 On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek 
 wrote:

> Of course, but it's in memory and not persisted which is much faster,
> and as I said- I believe that most of the interaction with it is during 
> the
> planning and save and not actual query run operations, and they are short
> and minimal compared to data fetching and manipulation so I don't believe
> it will have big impact on query run...
>
> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh ‏<
> mich.talebza...@gmail.com>:
>
>> Well, I will be surprised because Derby database is single threaded
>> and won't be much of a use here.
>>
>> Most Hive metastore in the commercial world utilise postgres or
>> Oracle for metastore that are battle proven, replicated and backed up.
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread L. C. Hsieh
+1

On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang  wrote:

> +1
>
> On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek  wrote:
>
>> Of course, I can't think of a scenario of thousands of tables with single
>> in memory Spark cluster with in memory catalog.
>> Thanks for the help!
>>
>> בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏<
>> mich.talebza...@gmail.com>:
>>
>>>
>>>
>>> Agreed. In scenarios where most of the interactions with the catalog are
>>> related to query planning, saving and metadata management, the choice of
>>> catalog implementation may have less impact on query runtime performance.
>>> This is because the time spent on metadata operations is generally
>>> minimal compared to the time spent on actual data fetching, processing, and
>>> computation.
>>> However, if we consider scalability and reliability concerns, especially
>>> as the size and complexity of data and query workload grow. While an
>>> in-memory catalog may offer excellent performance for smaller workloads,
>>> it will face limitations in handling larger-scale deployments with
>>> thousands of tables, partitions, and users. Additionally, durability and
>>> persistence are crucial considerations, particularly in production
>>> environments where data integrity
>>> and availability are crucial. In-memory catalog implementations may lack
>>> durability, meaning that metadata changes could be lost in the event of a
>>> system failure or restart. Therefore, while in-memory catalog
>>> implementations can provide speed and efficiency for certain use cases, we
>>> ought to consider the requirements for scalability, reliability, and data
>>> durability when choosing a catalog solution for production deployments. In
>>> many cases, a combination of in-memory and disk-based catalog solutions may
>>> offer the best balance of performance and resilience for demanding large
>>> scale workloads.
>>>
>>>
>>> HTH
>>>
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek  wrote:
>>>
 Of course, but it's in memory and not persisted which is much faster,
 and as I said- I believe that most of the interaction with it is during the
 planning and save and not actual query run operations, and they are short
 and minimal compared to data fetching and manipulation so I don't believe
 it will have big impact on query run...

 בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh ‏<
 mich.talebza...@gmail.com>:

> Well, I will be surprised because Derby database is single threaded
> and won't be much of a use here.
>
> Most Hive metastore in the commercial world utilise postgres or Oracle
> for metastore that are battle proven, replicated and backed up.
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner
> Von Braun
> )".
>
>
> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek 
> wrote:
>
>> Yes, in memory hive catalog backed by local Derby DB.
>> And again, I presume that most metadata related parts are during
>> planning and not actual run, so I don't see why it should strongly affect
>> query performance.
>>
>> Thanks,
>>
>>
>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
>> mich.talebza...@gmail.com>:
>>
>>> With regard to your point below
>>>
>>> "The thing I'm missing is this: let's say that the output format I
>>> choose is delta lake or iceberg or whatever format that uses parquet. 
>>> Where
>>> does the catalog implementation (which holds metadata afaik, same 
>>> metadata
>>> that iceberg and delta lake save for their tables about their columns)
>>> comes into play and why should it 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Yuming Wang
+1

On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek  wrote:

> Of course, I can't think of a scenario of thousands of tables with single
> in memory Spark cluster with in memory catalog.
> Thanks for the help!
>
> בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏<
> mich.talebza...@gmail.com>:
>
>>
>>
>> Agreed. In scenarios where most of the interactions with the catalog are
>> related to query planning, saving and metadata management, the choice of
>> catalog implementation may have less impact on query runtime performance.
>> This is because the time spent on metadata operations is generally
>> minimal compared to the time spent on actual data fetching, processing, and
>> computation.
>> However, if we consider scalability and reliability concerns, especially
>> as the size and complexity of data and query workload grow. While an
>> in-memory catalog may offer excellent performance for smaller workloads,
>> it will face limitations in handling larger-scale deployments with
>> thousands of tables, partitions, and users. Additionally, durability and
>> persistence are crucial considerations, particularly in production
>> environments where data integrity
>> and availability are crucial. In-memory catalog implementations may lack
>> durability, meaning that metadata changes could be lost in the event of a
>> system failure or restart. Therefore, while in-memory catalog
>> implementations can provide speed and efficiency for certain use cases, we
>> ought to consider the requirements for scalability, reliability, and data
>> durability when choosing a catalog solution for production deployments. In
>> many cases, a combination of in-memory and disk-based catalog solutions may
>> offer the best balance of performance and resilience for demanding large
>> scale workloads.
>>
>>
>> HTH
>>
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek  wrote:
>>
>>> Of course, but it's in memory and not persisted which is much faster,
>>> and as I said- I believe that most of the interaction with it is during the
>>> planning and save and not actual query run operations, and they are short
>>> and minimal compared to data fetching and manipulation so I don't believe
>>> it will have big impact on query run...
>>>
>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh ‏<
>>> mich.talebza...@gmail.com>:
>>>
 Well, I will be surprised because Derby database is single threaded and
 won't be much of a use here.

 Most Hive metastore in the commercial world utilise postgres or Oracle
 for metastore that are battle proven, replicated and backed up.

 Mich Talebzadeh,
 Technologist | Architect | Data Engineer  | Generative AI | FinCrime
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* The information provided is correct to the best of my
 knowledge but of course cannot be guaranteed . It is essential to note
 that, as with any advice, quote "one test result is worth one-thousand
 expert opinions (Werner
 Von Braun
 )".


 On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek 
 wrote:

> Yes, in memory hive catalog backed by local Derby DB.
> And again, I presume that most metadata related parts are during
> planning and not actual run, so I don't see why it should strongly affect
> query performance.
>
> Thanks,
>
>
> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
> mich.talebza...@gmail.com>:
>
>> With regard to your point below
>>
>> "The thing I'm missing is this: let's say that the output format I
>> choose is delta lake or iceberg or whatever format that uses parquet. 
>> Where
>> does the catalog implementation (which holds metadata afaik, same 
>> metadata
>> that iceberg and delta lake save for their tables about their columns)
>> comes into play and why should it affect performance? "
>>
>> The catalog implementation comes into play regardless of the output
>> format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is
>> 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Of course, I can't think of a scenario of thousands of tables with single
in memory Spark cluster with in memory catalog.
Thanks for the help!

בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏<
mich.talebza...@gmail.com>:

>
>
> Agreed. In scenarios where most of the interactions with the catalog are
> related to query planning, saving and metadata management, the choice of
> catalog implementation may have less impact on query runtime performance.
> This is because the time spent on metadata operations is generally minimal
> compared to the time spent on actual data fetching, processing, and
> computation.
> However, if we consider scalability and reliability concerns, especially
> as the size and complexity of data and query workload grow. While an
> in-memory catalog may offer excellent performance for smaller workloads,
> it will face limitations in handling larger-scale deployments with
> thousands of tables, partitions, and users. Additionally, durability and
> persistence are crucial considerations, particularly in production
> environments where data integrity
> and availability are crucial. In-memory catalog implementations may lack
> durability, meaning that metadata changes could be lost in the event of a
> system failure or restart. Therefore, while in-memory catalog
> implementations can provide speed and efficiency for certain use cases, we
> ought to consider the requirements for scalability, reliability, and data
> durability when choosing a catalog solution for production deployments. In
> many cases, a combination of in-memory and disk-based catalog solutions may
> offer the best balance of performance and resilience for demanding large
> scale workloads.
>
>
> HTH
>
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek  wrote:
>
>> Of course, but it's in memory and not persisted which is much faster, and
>> as I said- I believe that most of the interaction with it is during the
>> planning and save and not actual query run operations, and they are short
>> and minimal compared to data fetching and manipulation so I don't believe
>> it will have big impact on query run...
>>
>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh ‏<
>> mich.talebza...@gmail.com>:
>>
>>> Well, I will be surprised because Derby database is single threaded and
>>> won't be much of a use here.
>>>
>>> Most Hive metastore in the commercial world utilise postgres or Oracle
>>> for metastore that are battle proven, replicated and backed up.
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek  wrote:
>>>
 Yes, in memory hive catalog backed by local Derby DB.
 And again, I presume that most metadata related parts are during
 planning and not actual run, so I don't see why it should strongly affect
 query performance.

 Thanks,


 בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
 mich.talebza...@gmail.com>:

> With regard to your point below
>
> "The thing I'm missing is this: let's say that the output format I
> choose is delta lake or iceberg or whatever format that uses parquet. 
> Where
> does the catalog implementation (which holds metadata afaik, same metadata
> that iceberg and delta lake save for their tables about their columns)
> comes into play and why should it affect performance? "
>
> The catalog implementation comes into play regardless of the output
> format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is
> responsible for managing metadata about the datasets, tables, schemas, and
> other objects stored in aforementioned formats. Even though Delta Lake and
> Iceberg have their metadata management 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
ok thanks got it

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Thu, 25 Apr 2024 at 15:07, Wenchen Fan  wrote:

> It's for the data source. For example, Spark's built-in Parquet
> reader/writer is faster than the Hive serde Parquet reader/writer.
>
> On Thu, Apr 25, 2024 at 9:55 PM Mich Talebzadeh 
> wrote:
>
>> I see a statement made as below  and I quote
>>
>> "The proposal of SPARK-46122 is to switch the default value of this
>> configuration from `true` to `false` to use Spark native tables because
>> we support better."
>>
>> Can you please elaborate on the above specifically with regard to the
>> phrase ".. because
>> we support better."
>>
>> Are you referring to the performance of Spark catalog (I believe it is
>> internal) or integration with Spark?
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan  wrote:
>>
>>> +1
>>>
>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao  wrote:
>>>
 +1

 Nit: the umbrella ticket is SPARK-44111, not SPARK-4.

 Thanks,
 Kent Yao

 Dongjoon Hyun  于2024年4月25日周四 14:39写道:
 >
 > Hi, All.
 >
 > It's great to see community activities to polish 4.0.0 more and more.
 > Thank you all.
 >
 > I'd like to bring SPARK-46122 (another SQL topic) to you from the
 subtasks
 > of SPARK-4 (Prepare Apache Spark 4.0.0),
 >
 > - https://issues.apache.org/jira/browse/SPARK-46122
 >Set `spark.sql.legacy.createHiveTableByDefault` to `false` by
 default
 >
 > This legacy configuration is about `CREATE TABLE` SQL syntax without
 > `USING` and `STORED AS`, which is currently mapped to `Hive` table.
 > The proposal of SPARK-46122 is to switch the default value of this
 > configuration from `true` to `false` to use Spark native tables
 because
 > we support better.
 >
 > In other words, Spark will use the value of
 `spark.sql.sources.default`
 > as the table provider instead of `Hive` like the other Spark APIs. Of
 course,
 > the users can get all the legacy behavior by setting back to `true`.
 >
 > Historically, this behavior change was merged once at Apache Spark
 3.0.0
 > preparation via SPARK-30098 already, but reverted during the 3.0.0 RC
 period.
 >
 > 2019-12-06: SPARK-30098 Use default datasource as provider for CREATE
 TABLE
 > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default datasource as
 > provider for CREATE TABLE command
 >
 > At Apache Spark 3.1.0, we had another discussion about this and
 defined it
 > as one of legacy behavior via this configuration via reused ID,
 SPARK-30098.
 >
 > 2020-12-01:
 https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
 > 2020-12-03: SPARK-30098 Add a configuration to use default datasource
 as
 > provider for CREATE TABLE command
 >
 > Last year, we received two additional requests twice to switch this
 because
 > Apache Spark 4.0.0 is a good time to make a decision for the future
 direction.
 >
 > 2023-02-27: SPARK-42603 as an independent idea.
 > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea
 >
 >
 > WDYT? The technical scope is defined in the following PR which is one
 line of main
 > code, one line of migration guide, and a few lines of test code.
 >
 > - https://github.com/apache/spark/pull/46207
 >
 > Dongjoon.

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
Agreed. In scenarios where most of the interactions with the catalog are
related to query planning, saving and metadata management, the choice of
catalog implementation may have less impact on query runtime performance.
This is because the time spent on metadata operations is generally minimal
compared to the time spent on actual data fetching, processing, and
computation.
However, if we consider scalability and reliability concerns, especially as
the size and complexity of data and query workload grow. While an in-memory
catalog may offer excellent performance for smaller workloads,
it will face limitations in handling larger-scale deployments with
thousands of tables, partitions, and users. Additionally, durability and
persistence are crucial considerations, particularly in production
environments where data integrity
and availability are crucial. In-memory catalog implementations may lack
durability, meaning that metadata changes could be lost in the event of a
system failure or restart. Therefore, while in-memory catalog
implementations can provide speed and efficiency for certain use cases, we
ought to consider the requirements for scalability, reliability, and data
durability when choosing a catalog solution for production deployments. In
many cases, a combination of in-memory and disk-based catalog solutions may
offer the best balance of performance and resilience for demanding large
scale workloads.


HTH


Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek  wrote:

> Of course, but it's in memory and not persisted which is much faster, and
> as I said- I believe that most of the interaction with it is during the
> planning and save and not actual query run operations, and they are short
> and minimal compared to data fetching and manipulation so I don't believe
> it will have big impact on query run...
>
> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh ‏<
> mich.talebza...@gmail.com>:
>
>> Well, I will be surprised because Derby database is single threaded and
>> won't be much of a use here.
>>
>> Most Hive metastore in the commercial world utilise postgres or Oracle
>> for metastore that are battle proven, replicated and backed up.
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek  wrote:
>>
>>> Yes, in memory hive catalog backed by local Derby DB.
>>> And again, I presume that most metadata related parts are during
>>> planning and not actual run, so I don't see why it should strongly affect
>>> query performance.
>>>
>>> Thanks,
>>>
>>>
>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
>>> mich.talebza...@gmail.com>:
>>>
 With regard to your point below

 "The thing I'm missing is this: let's say that the output format I
 choose is delta lake or iceberg or whatever format that uses parquet. Where
 does the catalog implementation (which holds metadata afaik, same metadata
 that iceberg and delta lake save for their tables about their columns)
 comes into play and why should it affect performance? "

 The catalog implementation comes into play regardless of the output
 format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is
 responsible for managing metadata about the datasets, tables, schemas, and
 other objects stored in aforementioned formats. Even though Delta Lake and
 Iceberg have their metadata management mechanisms internally, they still
 rely on the catalog for providing a unified interface for accessing and
 manipulating metadata across different storage formats.

 "Another thing is that if I understand correctly, and I might be
 totally wrong here, the internal spark catalog is a local installation of
 hive metastore anyway, so I'm not sure what the catalog has 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Of course, but it's in memory and not persisted which is much faster, and
as I said- I believe that most of the interaction with it is during the
planning and save and not actual query run operations, and they are short
and minimal compared to data fetching and manipulation so I don't believe
it will have big impact on query run...

בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh ‏<
mich.talebza...@gmail.com>:

> Well, I will be surprised because Derby database is single threaded and
> won't be much of a use here.
>
> Most Hive metastore in the commercial world utilise postgres or Oracle for
> metastore that are battle proven, replicated and backed up.
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek  wrote:
>
>> Yes, in memory hive catalog backed by local Derby DB.
>> And again, I presume that most metadata related parts are during planning
>> and not actual run, so I don't see why it should strongly affect query
>> performance.
>>
>> Thanks,
>>
>>
>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
>> mich.talebza...@gmail.com>:
>>
>>> With regard to your point below
>>>
>>> "The thing I'm missing is this: let's say that the output format I
>>> choose is delta lake or iceberg or whatever format that uses parquet. Where
>>> does the catalog implementation (which holds metadata afaik, same metadata
>>> that iceberg and delta lake save for their tables about their columns)
>>> comes into play and why should it affect performance? "
>>>
>>> The catalog implementation comes into play regardless of the output
>>> format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is
>>> responsible for managing metadata about the datasets, tables, schemas, and
>>> other objects stored in aforementioned formats. Even though Delta Lake and
>>> Iceberg have their metadata management mechanisms internally, they still
>>> rely on the catalog for providing a unified interface for accessing and
>>> manipulating metadata across different storage formats.
>>>
>>> "Another thing is that if I understand correctly, and I might be totally
>>> wrong here, the internal spark catalog is a local installation of hive
>>> metastore anyway, so I'm not sure what the catalog has to do with anything"
>>>
>>> .I don't understand this. Do you mean a Derby database?
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek  wrote:
>>>
 Thanks for the detailed answer.
 The thing I'm missing is this: let's say that the output format I
 choose is delta lake or iceberg or whatever format that uses parquet. Where
 does the catalog implementation (which holds metadata afaik, same metadata
 that iceberg and delta lake save for their tables about their columns)
 comes into play and why should it affect performance?
 Another thing is that if I understand correctly, and I might be totally
 wrong here, the internal spark catalog is a local installation of hive
 metastore anyway, so I'm not sure what the catalog has to do with anything.

 Thanks!


 בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
 mich.talebza...@gmail.com>:

> My take regarding your question is that your mileage varies so to
> speak.
>
> 1) Hive provides a more mature and widely adopted catalog solution
> that integrates well with other components in the Hadoop ecosystem, such 
> as
> HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), using
> Hive may offer better compatibility and interoperability.
> 2) Hive provides a SQL-like interface that is familiar to users who
> are accustomed to traditional RDBMs. If your use case involves complex SQL
> 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
Well, I will be surprised because Derby database is single threaded and
won't be much of a use here.

Most Hive metastore in the commercial world utilise postgres or Oracle for
metastore that are battle proven, replicated and backed up.

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek  wrote:

> Yes, in memory hive catalog backed by local Derby DB.
> And again, I presume that most metadata related parts are during planning
> and not actual run, so I don't see why it should strongly affect query
> performance.
>
> Thanks,
>
>
> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
> mich.talebza...@gmail.com>:
>
>> With regard to your point below
>>
>> "The thing I'm missing is this: let's say that the output format I choose
>> is delta lake or iceberg or whatever format that uses parquet. Where does
>> the catalog implementation (which holds metadata afaik, same metadata that
>> iceberg and delta lake save for their tables about their columns) comes
>> into play and why should it affect performance? "
>>
>> The catalog implementation comes into play regardless of the output
>> format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is
>> responsible for managing metadata about the datasets, tables, schemas, and
>> other objects stored in aforementioned formats. Even though Delta Lake and
>> Iceberg have their metadata management mechanisms internally, they still
>> rely on the catalog for providing a unified interface for accessing and
>> manipulating metadata across different storage formats.
>>
>> "Another thing is that if I understand correctly, and I might be totally
>> wrong here, the internal spark catalog is a local installation of hive
>> metastore anyway, so I'm not sure what the catalog has to do with anything"
>>
>> .I don't understand this. Do you mean a Derby database?
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek  wrote:
>>
>>> Thanks for the detailed answer.
>>> The thing I'm missing is this: let's say that the output format I choose
>>> is delta lake or iceberg or whatever format that uses parquet. Where does
>>> the catalog implementation (which holds metadata afaik, same metadata that
>>> iceberg and delta lake save for their tables about their columns) comes
>>> into play and why should it affect performance?
>>> Another thing is that if I understand correctly, and I might be totally
>>> wrong here, the internal spark catalog is a local installation of hive
>>> metastore anyway, so I'm not sure what the catalog has to do with anything.
>>>
>>> Thanks!
>>>
>>>
>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
>>> mich.talebza...@gmail.com>:
>>>
 My take regarding your question is that your mileage varies so to
 speak.

 1) Hive provides a more mature and widely adopted catalog solution that
 integrates well with other components in the Hadoop ecosystem, such as
 HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), using
 Hive may offer better compatibility and interoperability.
 2) Hive provides a SQL-like interface that is familiar to users who are
 accustomed to traditional RDBMs. If your use case involves complex SQL
 queries or existing SQL-based workflows, using Hive may be advantageous.
 3) If you are looking for performance, spark's native catalog tends to
 offer better performance for certain workloads, particularly those that
 involve iterative processing or complex data transformations.(my
 understanding). Spark's in-memory processing capabilities and optimizations
 make it well-suited for interactive analytics and machine learning
 tasks.(my favourite)
 4) Integration with Spark Workflows: If you primarily use Spark for
 data 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Yes, in memory hive catalog backed by local Derby DB.
And again, I presume that most metadata related parts are during planning
and not actual run, so I don't see why it should strongly affect query
performance.

Thanks,


בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
mich.talebza...@gmail.com>:

> With regard to your point below
>
> "The thing I'm missing is this: let's say that the output format I choose
> is delta lake or iceberg or whatever format that uses parquet. Where does
> the catalog implementation (which holds metadata afaik, same metadata that
> iceberg and delta lake save for their tables about their columns) comes
> into play and why should it affect performance? "
>
> The catalog implementation comes into play regardless of the output format
> chosen (Delta Lake, Iceberg, Parquet, etc.) because it is responsible for
> managing metadata about the datasets, tables, schemas, and other objects
> stored in aforementioned formats. Even though Delta Lake and Iceberg have
> their metadata management mechanisms internally, they still rely on the
> catalog for providing a unified interface for accessing and manipulating
> metadata across different storage formats.
>
> "Another thing is that if I understand correctly, and I might be totally
> wrong here, the internal spark catalog is a local installation of hive
> metastore anyway, so I'm not sure what the catalog has to do with anything"
>
> .I don't understand this. Do you mean a Derby database?
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek  wrote:
>
>> Thanks for the detailed answer.
>> The thing I'm missing is this: let's say that the output format I choose
>> is delta lake or iceberg or whatever format that uses parquet. Where does
>> the catalog implementation (which holds metadata afaik, same metadata that
>> iceberg and delta lake save for their tables about their columns) comes
>> into play and why should it affect performance?
>> Another thing is that if I understand correctly, and I might be totally
>> wrong here, the internal spark catalog is a local installation of hive
>> metastore anyway, so I'm not sure what the catalog has to do with anything.
>>
>> Thanks!
>>
>>
>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
>> mich.talebza...@gmail.com>:
>>
>>> My take regarding your question is that your mileage varies so to speak.
>>>
>>> 1) Hive provides a more mature and widely adopted catalog solution that
>>> integrates well with other components in the Hadoop ecosystem, such as
>>> HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), using
>>> Hive may offer better compatibility and interoperability.
>>> 2) Hive provides a SQL-like interface that is familiar to users who are
>>> accustomed to traditional RDBMs. If your use case involves complex SQL
>>> queries or existing SQL-based workflows, using Hive may be advantageous.
>>> 3) If you are looking for performance, spark's native catalog tends to
>>> offer better performance for certain workloads, particularly those that
>>> involve iterative processing or complex data transformations.(my
>>> understanding). Spark's in-memory processing capabilities and optimizations
>>> make it well-suited for interactive analytics and machine learning
>>> tasks.(my favourite)
>>> 4) Integration with Spark Workflows: If you primarily use Spark for data
>>> processing and analytics, using Spark's native catalog may simplify
>>> workflow management and reduce overhead, Spark's  tight integration with
>>> its catalog allows for seamless interaction with Spark applications and
>>> libraries.
>>> 5) There seems to be some similarity with spark catalog and
>>> Databricks unity catalog, so that may favour the choice.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
With regard to your point below

"The thing I'm missing is this: let's say that the output format I choose
is delta lake or iceberg or whatever format that uses parquet. Where does
the catalog implementation (which holds metadata afaik, same metadata that
iceberg and delta lake save for their tables about their columns) comes
into play and why should it affect performance? "

The catalog implementation comes into play regardless of the output format
chosen (Delta Lake, Iceberg, Parquet, etc.) because it is responsible for
managing metadata about the datasets, tables, schemas, and other objects
stored in aforementioned formats. Even though Delta Lake and Iceberg have
their metadata management mechanisms internally, they still rely on the
catalog for providing a unified interface for accessing and manipulating
metadata across different storage formats.

"Another thing is that if I understand correctly, and I might be totally
wrong here, the internal spark catalog is a local installation of hive
metastore anyway, so I'm not sure what the catalog has to do with anything"

.I don't understand this. Do you mean a Derby database?

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek  wrote:

> Thanks for the detailed answer.
> The thing I'm missing is this: let's say that the output format I choose
> is delta lake or iceberg or whatever format that uses parquet. Where does
> the catalog implementation (which holds metadata afaik, same metadata that
> iceberg and delta lake save for their tables about their columns) comes
> into play and why should it affect performance?
> Another thing is that if I understand correctly, and I might be totally
> wrong here, the internal spark catalog is a local installation of hive
> metastore anyway, so I'm not sure what the catalog has to do with anything.
>
> Thanks!
>
>
> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
> mich.talebza...@gmail.com>:
>
>> My take regarding your question is that your mileage varies so to speak.
>>
>> 1) Hive provides a more mature and widely adopted catalog solution that
>> integrates well with other components in the Hadoop ecosystem, such as
>> HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), using
>> Hive may offer better compatibility and interoperability.
>> 2) Hive provides a SQL-like interface that is familiar to users who are
>> accustomed to traditional RDBMs. If your use case involves complex SQL
>> queries or existing SQL-based workflows, using Hive may be advantageous.
>> 3) If you are looking for performance, spark's native catalog tends to
>> offer better performance for certain workloads, particularly those that
>> involve iterative processing or complex data transformations.(my
>> understanding). Spark's in-memory processing capabilities and optimizations
>> make it well-suited for interactive analytics and machine learning
>> tasks.(my favourite)
>> 4) Integration with Spark Workflows: If you primarily use Spark for data
>> processing and analytics, using Spark's native catalog may simplify
>> workflow management and reduce overhead, Spark's  tight integration with
>> its catalog allows for seamless interaction with Spark applications and
>> libraries.
>> 5) There seems to be some similarity with spark catalog and
>> Databricks unity catalog, so that may favour the choice.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek  wrote:
>>
>>> I will also appreciate some material that describes the differences
>>> between Spark native tables vs hive tables and why each should be used...
>>>
>>> Thanks
>>> Nimrod
>>>
>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏<
>>> mich.talebza...@gmail.com>:
>>>
 I see a statement made as below  and I quote

 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Wenchen Fan
It's for the data source. For example, Spark's built-in Parquet
reader/writer is faster than the Hive serde Parquet reader/writer.

On Thu, Apr 25, 2024 at 9:55 PM Mich Talebzadeh 
wrote:

> I see a statement made as below  and I quote
>
> "The proposal of SPARK-46122 is to switch the default value of this
> configuration from `true` to `false` to use Spark native tables because
> we support better."
>
> Can you please elaborate on the above specifically with regard to the
> phrase ".. because
> we support better."
>
> Are you referring to the performance of Spark catalog (I believe it is
> internal) or integration with Spark?
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan  wrote:
>
>> +1
>>
>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao  wrote:
>>
>>> +1
>>>
>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-4.
>>>
>>> Thanks,
>>> Kent Yao
>>>
>>> Dongjoon Hyun  于2024年4月25日周四 14:39写道:
>>> >
>>> > Hi, All.
>>> >
>>> > It's great to see community activities to polish 4.0.0 more and more.
>>> > Thank you all.
>>> >
>>> > I'd like to bring SPARK-46122 (another SQL topic) to you from the
>>> subtasks
>>> > of SPARK-4 (Prepare Apache Spark 4.0.0),
>>> >
>>> > - https://issues.apache.org/jira/browse/SPARK-46122
>>> >Set `spark.sql.legacy.createHiveTableByDefault` to `false` by
>>> default
>>> >
>>> > This legacy configuration is about `CREATE TABLE` SQL syntax without
>>> > `USING` and `STORED AS`, which is currently mapped to `Hive` table.
>>> > The proposal of SPARK-46122 is to switch the default value of this
>>> > configuration from `true` to `false` to use Spark native tables because
>>> > we support better.
>>> >
>>> > In other words, Spark will use the value of `spark.sql.sources.default`
>>> > as the table provider instead of `Hive` like the other Spark APIs. Of
>>> course,
>>> > the users can get all the legacy behavior by setting back to `true`.
>>> >
>>> > Historically, this behavior change was merged once at Apache Spark
>>> 3.0.0
>>> > preparation via SPARK-30098 already, but reverted during the 3.0.0 RC
>>> period.
>>> >
>>> > 2019-12-06: SPARK-30098 Use default datasource as provider for CREATE
>>> TABLE
>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default datasource as
>>> > provider for CREATE TABLE command
>>> >
>>> > At Apache Spark 3.1.0, we had another discussion about this and
>>> defined it
>>> > as one of legacy behavior via this configuration via reused ID,
>>> SPARK-30098.
>>> >
>>> > 2020-12-01:
>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
>>> > 2020-12-03: SPARK-30098 Add a configuration to use default datasource
>>> as
>>> > provider for CREATE TABLE command
>>> >
>>> > Last year, we received two additional requests twice to switch this
>>> because
>>> > Apache Spark 4.0.0 is a good time to make a decision for the future
>>> direction.
>>> >
>>> > 2023-02-27: SPARK-42603 as an independent idea.
>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea
>>> >
>>> >
>>> > WDYT? The technical scope is defined in the following PR which is one
>>> line of main
>>> > code, one line of migration guide, and a few lines of test code.
>>> >
>>> > - https://github.com/apache/spark/pull/46207
>>> >
>>> > Dongjoon.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Thanks for the detailed answer.
The thing I'm missing is this: let's say that the output format I choose is
delta lake or iceberg or whatever format that uses parquet. Where does the
catalog implementation (which holds metadata afaik, same metadata that
iceberg and delta lake save for their tables about their columns) comes
into play and why should it affect performance?
Another thing is that if I understand correctly, and I might be totally
wrong here, the internal spark catalog is a local installation of hive
metastore anyway, so I'm not sure what the catalog has to do with anything.

Thanks!


בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
mich.talebza...@gmail.com>:

> My take regarding your question is that your mileage varies so to speak.
>
> 1) Hive provides a more mature and widely adopted catalog solution that
> integrates well with other components in the Hadoop ecosystem, such as
> HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), using
> Hive may offer better compatibility and interoperability.
> 2) Hive provides a SQL-like interface that is familiar to users who are
> accustomed to traditional RDBMs. If your use case involves complex SQL
> queries or existing SQL-based workflows, using Hive may be advantageous.
> 3) If you are looking for performance, spark's native catalog tends to
> offer better performance for certain workloads, particularly those that
> involve iterative processing or complex data transformations.(my
> understanding). Spark's in-memory processing capabilities and optimizations
> make it well-suited for interactive analytics and machine learning
> tasks.(my favourite)
> 4) Integration with Spark Workflows: If you primarily use Spark for data
> processing and analytics, using Spark's native catalog may simplify
> workflow management and reduce overhead, Spark's  tight integration with
> its catalog allows for seamless interaction with Spark applications and
> libraries.
> 5) There seems to be some similarity with spark catalog and
> Databricks unity catalog, so that may favour the choice.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek  wrote:
>
>> I will also appreciate some material that describes the differences
>> between Spark native tables vs hive tables and why each should be used...
>>
>> Thanks
>> Nimrod
>>
>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏<
>> mich.talebza...@gmail.com>:
>>
>>> I see a statement made as below  and I quote
>>>
>>> "The proposal of SPARK-46122 is to switch the default value of this
>>> configuration from `true` to `false` to use Spark native tables because
>>> we support better."
>>>
>>> Can you please elaborate on the above specifically with regard to the
>>> phrase ".. because
>>> we support better."
>>>
>>> Are you referring to the performance of Spark catalog (I believe it is
>>> internal) or integration with Spark?
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan  wrote:
>>>
 +1

 On Thu, Apr 25, 2024 at 2:46 PM Kent Yao  wrote:

> +1
>
> Nit: the umbrella ticket is SPARK-44111, not SPARK-4.
>
> Thanks,
> Kent Yao
>
> Dongjoon Hyun  于2024年4月25日周四 14:39写道:
> >
> > Hi, All.
> >
> > It's great to see community activities to polish 4.0.0 more and more.
> > Thank you all.
> >
> > I'd like to bring SPARK-46122 (another SQL topic) to you from the
> subtasks
> > of SPARK-4 (Prepare Apache Spark 4.0.0),
> >
> > - https://issues.apache.org/jira/browse/SPARK-46122
> >Set `spark.sql.legacy.createHiveTableByDefault` to `false` by
> default
> >
> > This legacy configuration is about 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
My take regarding your question is that your mileage varies so to speak.

1) Hive provides a more mature and widely adopted catalog solution that
integrates well with other components in the Hadoop ecosystem, such as
HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), using
Hive may offer better compatibility and interoperability.
2) Hive provides a SQL-like interface that is familiar to users who are
accustomed to traditional RDBMs. If your use case involves complex SQL
queries or existing SQL-based workflows, using Hive may be advantageous.
3) If you are looking for performance, spark's native catalog tends to
offer better performance for certain workloads, particularly those that
involve iterative processing or complex data transformations.(my
understanding). Spark's in-memory processing capabilities and optimizations
make it well-suited for interactive analytics and machine learning
tasks.(my favourite)
4) Integration with Spark Workflows: If you primarily use Spark for data
processing and analytics, using Spark's native catalog may simplify
workflow management and reduce overhead, Spark's  tight integration with
its catalog allows for seamless interaction with Spark applications and
libraries.
5) There seems to be some similarity with spark catalog and
Databricks unity catalog, so that may favour the choice.

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek  wrote:

> I will also appreciate some material that describes the differences
> between Spark native tables vs hive tables and why each should be used...
>
> Thanks
> Nimrod
>
> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏<
> mich.talebza...@gmail.com>:
>
>> I see a statement made as below  and I quote
>>
>> "The proposal of SPARK-46122 is to switch the default value of this
>> configuration from `true` to `false` to use Spark native tables because
>> we support better."
>>
>> Can you please elaborate on the above specifically with regard to the
>> phrase ".. because
>> we support better."
>>
>> Are you referring to the performance of Spark catalog (I believe it is
>> internal) or integration with Spark?
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan  wrote:
>>
>>> +1
>>>
>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao  wrote:
>>>
 +1

 Nit: the umbrella ticket is SPARK-44111, not SPARK-4.

 Thanks,
 Kent Yao

 Dongjoon Hyun  于2024年4月25日周四 14:39写道:
 >
 > Hi, All.
 >
 > It's great to see community activities to polish 4.0.0 more and more.
 > Thank you all.
 >
 > I'd like to bring SPARK-46122 (another SQL topic) to you from the
 subtasks
 > of SPARK-4 (Prepare Apache Spark 4.0.0),
 >
 > - https://issues.apache.org/jira/browse/SPARK-46122
 >Set `spark.sql.legacy.createHiveTableByDefault` to `false` by
 default
 >
 > This legacy configuration is about `CREATE TABLE` SQL syntax without
 > `USING` and `STORED AS`, which is currently mapped to `Hive` table.
 > The proposal of SPARK-46122 is to switch the default value of this
 > configuration from `true` to `false` to use Spark native tables
 because
 > we support better.
 >
 > In other words, Spark will use the value of
 `spark.sql.sources.default`
 > as the table provider instead of `Hive` like the other Spark APIs. Of
 course,
 > the users can get all the legacy behavior by setting back to `true`.
 >
 > Historically, this behavior change was merged once at Apache Spark
 3.0.0
 > preparation via SPARK-30098 already, but reverted during the 3.0.0 RC
 period.
 >
 > 2019-12-06: SPARK-30098 Use default datasource as provider for CREATE
 TABLE
 > 2020-05-16: 

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
I will also appreciate some material that describes the differences between
Spark native tables vs hive tables and why each should be used...

Thanks
Nimrod

בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏<
mich.talebza...@gmail.com>:

> I see a statement made as below  and I quote
>
> "The proposal of SPARK-46122 is to switch the default value of this
> configuration from `true` to `false` to use Spark native tables because
> we support better."
>
> Can you please elaborate on the above specifically with regard to the
> phrase ".. because
> we support better."
>
> Are you referring to the performance of Spark catalog (I believe it is
> internal) or integration with Spark?
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan  wrote:
>
>> +1
>>
>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao  wrote:
>>
>>> +1
>>>
>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-4.
>>>
>>> Thanks,
>>> Kent Yao
>>>
>>> Dongjoon Hyun  于2024年4月25日周四 14:39写道:
>>> >
>>> > Hi, All.
>>> >
>>> > It's great to see community activities to polish 4.0.0 more and more.
>>> > Thank you all.
>>> >
>>> > I'd like to bring SPARK-46122 (another SQL topic) to you from the
>>> subtasks
>>> > of SPARK-4 (Prepare Apache Spark 4.0.0),
>>> >
>>> > - https://issues.apache.org/jira/browse/SPARK-46122
>>> >Set `spark.sql.legacy.createHiveTableByDefault` to `false` by
>>> default
>>> >
>>> > This legacy configuration is about `CREATE TABLE` SQL syntax without
>>> > `USING` and `STORED AS`, which is currently mapped to `Hive` table.
>>> > The proposal of SPARK-46122 is to switch the default value of this
>>> > configuration from `true` to `false` to use Spark native tables because
>>> > we support better.
>>> >
>>> > In other words, Spark will use the value of `spark.sql.sources.default`
>>> > as the table provider instead of `Hive` like the other Spark APIs. Of
>>> course,
>>> > the users can get all the legacy behavior by setting back to `true`.
>>> >
>>> > Historically, this behavior change was merged once at Apache Spark
>>> 3.0.0
>>> > preparation via SPARK-30098 already, but reverted during the 3.0.0 RC
>>> period.
>>> >
>>> > 2019-12-06: SPARK-30098 Use default datasource as provider for CREATE
>>> TABLE
>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default datasource as
>>> > provider for CREATE TABLE command
>>> >
>>> > At Apache Spark 3.1.0, we had another discussion about this and
>>> defined it
>>> > as one of legacy behavior via this configuration via reused ID,
>>> SPARK-30098.
>>> >
>>> > 2020-12-01:
>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
>>> > 2020-12-03: SPARK-30098 Add a configuration to use default datasource
>>> as
>>> > provider for CREATE TABLE command
>>> >
>>> > Last year, we received two additional requests twice to switch this
>>> because
>>> > Apache Spark 4.0.0 is a good time to make a decision for the future
>>> direction.
>>> >
>>> > 2023-02-27: SPARK-42603 as an independent idea.
>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea
>>> >
>>> >
>>> > WDYT? The technical scope is defined in the following PR which is one
>>> line of main
>>> > code, one line of migration guide, and a few lines of test code.
>>> >
>>> > - https://github.com/apache/spark/pull/46207
>>> >
>>> > Dongjoon.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
I see a statement made as below  and I quote

"The proposal of SPARK-46122 is to switch the default value of this
configuration from `true` to `false` to use Spark native tables because
we support better."

Can you please elaborate on the above specifically with regard to the
phrase ".. because
we support better."

Are you referring to the performance of Spark catalog (I believe it is
internal) or integration with Spark?

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Thu, 25 Apr 2024 at 11:17, Wenchen Fan  wrote:

> +1
>
> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao  wrote:
>
>> +1
>>
>> Nit: the umbrella ticket is SPARK-44111, not SPARK-4.
>>
>> Thanks,
>> Kent Yao
>>
>> Dongjoon Hyun  于2024年4月25日周四 14:39写道:
>> >
>> > Hi, All.
>> >
>> > It's great to see community activities to polish 4.0.0 more and more.
>> > Thank you all.
>> >
>> > I'd like to bring SPARK-46122 (another SQL topic) to you from the
>> subtasks
>> > of SPARK-4 (Prepare Apache Spark 4.0.0),
>> >
>> > - https://issues.apache.org/jira/browse/SPARK-46122
>> >Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default
>> >
>> > This legacy configuration is about `CREATE TABLE` SQL syntax without
>> > `USING` and `STORED AS`, which is currently mapped to `Hive` table.
>> > The proposal of SPARK-46122 is to switch the default value of this
>> > configuration from `true` to `false` to use Spark native tables because
>> > we support better.
>> >
>> > In other words, Spark will use the value of `spark.sql.sources.default`
>> > as the table provider instead of `Hive` like the other Spark APIs. Of
>> course,
>> > the users can get all the legacy behavior by setting back to `true`.
>> >
>> > Historically, this behavior change was merged once at Apache Spark 3.0.0
>> > preparation via SPARK-30098 already, but reverted during the 3.0.0 RC
>> period.
>> >
>> > 2019-12-06: SPARK-30098 Use default datasource as provider for CREATE
>> TABLE
>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default datasource as
>> > provider for CREATE TABLE command
>> >
>> > At Apache Spark 3.1.0, we had another discussion about this and defined
>> it
>> > as one of legacy behavior via this configuration via reused ID,
>> SPARK-30098.
>> >
>> > 2020-12-01:
>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
>> > 2020-12-03: SPARK-30098 Add a configuration to use default datasource as
>> > provider for CREATE TABLE command
>> >
>> > Last year, we received two additional requests twice to switch this
>> because
>> > Apache Spark 4.0.0 is a good time to make a decision for the future
>> direction.
>> >
>> > 2023-02-27: SPARK-42603 as an independent idea.
>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea
>> >
>> >
>> > WDYT? The technical scope is defined in the following PR which is one
>> line of main
>> > code, one line of migration guide, and a few lines of test code.
>> >
>> > - https://github.com/apache/spark/pull/46207
>> >
>> > Dongjoon.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Wenchen Fan
+1

On Thu, Apr 25, 2024 at 2:46 PM Kent Yao  wrote:

> +1
>
> Nit: the umbrella ticket is SPARK-44111, not SPARK-4.
>
> Thanks,
> Kent Yao
>
> Dongjoon Hyun  于2024年4月25日周四 14:39写道:
> >
> > Hi, All.
> >
> > It's great to see community activities to polish 4.0.0 more and more.
> > Thank you all.
> >
> > I'd like to bring SPARK-46122 (another SQL topic) to you from the
> subtasks
> > of SPARK-4 (Prepare Apache Spark 4.0.0),
> >
> > - https://issues.apache.org/jira/browse/SPARK-46122
> >Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default
> >
> > This legacy configuration is about `CREATE TABLE` SQL syntax without
> > `USING` and `STORED AS`, which is currently mapped to `Hive` table.
> > The proposal of SPARK-46122 is to switch the default value of this
> > configuration from `true` to `false` to use Spark native tables because
> > we support better.
> >
> > In other words, Spark will use the value of `spark.sql.sources.default`
> > as the table provider instead of `Hive` like the other Spark APIs. Of
> course,
> > the users can get all the legacy behavior by setting back to `true`.
> >
> > Historically, this behavior change was merged once at Apache Spark 3.0.0
> > preparation via SPARK-30098 already, but reverted during the 3.0.0 RC
> period.
> >
> > 2019-12-06: SPARK-30098 Use default datasource as provider for CREATE
> TABLE
> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default datasource as
> > provider for CREATE TABLE command
> >
> > At Apache Spark 3.1.0, we had another discussion about this and defined
> it
> > as one of legacy behavior via this configuration via reused ID,
> SPARK-30098.
> >
> > 2020-12-01:
> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
> > 2020-12-03: SPARK-30098 Add a configuration to use default datasource as
> > provider for CREATE TABLE command
> >
> > Last year, we received two additional requests twice to switch this
> because
> > Apache Spark 4.0.0 is a good time to make a decision for the future
> direction.
> >
> > 2023-02-27: SPARK-42603 as an independent idea.
> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea
> >
> >
> > WDYT? The technical scope is defined in the following PR which is one
> line of main
> > code, one line of migration guide, and a few lines of test code.
> >
> > - https://github.com/apache/spark/pull/46207
> >
> > Dongjoon.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Kent Yao
+1

Nit: the umbrella ticket is SPARK-44111, not SPARK-4.

Thanks,
Kent Yao

Dongjoon Hyun  于2024年4月25日周四 14:39写道:
>
> Hi, All.
>
> It's great to see community activities to polish 4.0.0 more and more.
> Thank you all.
>
> I'd like to bring SPARK-46122 (another SQL topic) to you from the subtasks
> of SPARK-4 (Prepare Apache Spark 4.0.0),
>
> - https://issues.apache.org/jira/browse/SPARK-46122
>Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default
>
> This legacy configuration is about `CREATE TABLE` SQL syntax without
> `USING` and `STORED AS`, which is currently mapped to `Hive` table.
> The proposal of SPARK-46122 is to switch the default value of this
> configuration from `true` to `false` to use Spark native tables because
> we support better.
>
> In other words, Spark will use the value of `spark.sql.sources.default`
> as the table provider instead of `Hive` like the other Spark APIs. Of course,
> the users can get all the legacy behavior by setting back to `true`.
>
> Historically, this behavior change was merged once at Apache Spark 3.0.0
> preparation via SPARK-30098 already, but reverted during the 3.0.0 RC period.
>
> 2019-12-06: SPARK-30098 Use default datasource as provider for CREATE TABLE
> 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default datasource as
> provider for CREATE TABLE command
>
> At Apache Spark 3.1.0, we had another discussion about this and defined it
> as one of legacy behavior via this configuration via reused ID, SPARK-30098.
>
> 2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
> 2020-12-03: SPARK-30098 Add a configuration to use default datasource as
> provider for CREATE TABLE command
>
> Last year, we received two additional requests twice to switch this because
> Apache Spark 4.0.0 is a good time to make a decision for the future direction.
>
> 2023-02-27: SPARK-42603 as an independent idea.
> 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea
>
>
> WDYT? The technical scope is defined in the following PR which is one line of 
> main
> code, one line of migration guide, and a few lines of test code.
>
> - https://github.com/apache/spark/pull/46207
>
> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org