Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-19 Thread Mich Talebzadeh
Ok thanks for your clarifications

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 19 Feb 2024 at 17:24, Chao Sun  wrote:

> Hi Mich,
>
> > Also have you got some benchmark results from your tests that you can
> possibly share?
>
> We only have some partial benchmark results internally so far. Once
> shuffle and better memory management have been introduced, we plan to
> publish the benchmark results (at least TPC-H) in the repo.
>
> > Compared to standard Spark, what kind of performance gains can be
> expected with Comet?
>
> Currently, users could benefit from Comet in a few areas:
> - Parquet read: a few improvements have been made against reading from S3
> in particular, so users can expect better scan performance in this scenario
> - Hash aggregation
> - Columnar shuffle
> - Decimals (Java's BigDecimal is pretty slow)
>
> > Can one use Comet on k8s in conjunction with something like a Volcano
> addon?
>
> I think so. Comet is mostly orthogonal to the Spark scheduler framework.
>
> Chao
>
>
>
>
>
>
> On Fri, Feb 16, 2024 at 5:39 AM Mich Talebzadeh 
> wrote:
>
>> Hi Chao,
>>
>> As a cool feature
>>
>>
>>- Compared to standard Spark, what kind of performance gains can be
>>expected with Comet?
>>-  Can one use Comet on k8s in conjunction with something like a
>>Volcano addon?
>>
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge, sourced from both personal expertise and other resources but of
>> course cannot be guaranteed . It is essential to note that, as with any
>> advice, one verified and tested result holds more weight than a thousand
>> expert opinions.
>>
>>
>> On Tue, 13 Feb 2024 at 20:42, Chao Sun  wrote:
>>
>>> Hi all,
>>>
>>> We are very happy to announce that Project Comet, a plugin to
>>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>>> has now been open sourced under the Apache Arrow umbrella. Please
>>> check the project repo
>>> https://github.com/apache/arrow-datafusion-comet for more details if
>>> you are interested. We'd love to collaborate with people from the open
>>> source community who share similar goals.
>>>
>>> Thanks,
>>> Chao
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>


Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-19 Thread Chao Sun
Hi Mich,

> Also have you got some benchmark results from your tests that you can
possibly share?

We only have some partial benchmark results internally so far. Once shuffle
and better memory management have been introduced, we plan to publish the
benchmark results (at least TPC-H) in the repo.

> Compared to standard Spark, what kind of performance gains can be
expected with Comet?

Currently, users could benefit from Comet in a few areas:
- Parquet read: a few improvements have been made against reading from S3
in particular, so users can expect better scan performance in this scenario
- Hash aggregation
- Columnar shuffle
- Decimals (Java's BigDecimal is pretty slow)

> Can one use Comet on k8s in conjunction with something like a Volcano
addon?

I think so. Comet is mostly orthogonal to the Spark scheduler framework.

Chao






On Fri, Feb 16, 2024 at 5:39 AM Mich Talebzadeh 
wrote:

> Hi Chao,
>
> As a cool feature
>
>
>- Compared to standard Spark, what kind of performance gains can be
>expected with Comet?
>-  Can one use Comet on k8s in conjunction with something like a
>Volcano addon?
>
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge, sourced from both personal expertise and other resources but of
> course cannot be guaranteed . It is essential to note that, as with any
> advice, one verified and tested result holds more weight than a thousand
> expert opinions.
>
>
> On Tue, 13 Feb 2024 at 20:42, Chao Sun  wrote:
>
>> Hi all,
>>
>> We are very happy to announce that Project Comet, a plugin to
>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>> has now been open sourced under the Apache Arrow umbrella. Please
>> check the project repo
>> https://github.com/apache/arrow-datafusion-comet for more details if
>> you are interested. We'd love to collaborate with people from the open
>> source community who share similar goals.
>>
>> Thanks,
>> Chao
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-16 Thread Mich Talebzadeh
Hi Chao,

As a cool feature


   - Compared to standard Spark, what kind of performance gains can be
   expected with Comet?
   -  Can one use Comet on k8s in conjunction with something like a Volcano
   addon?


HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge, sourced from both personal expertise and other resources but of
course cannot be guaranteed . It is essential to note that, as with any
advice, one verified and tested result holds more weight than a thousand
expert opinions.


On Tue, 13 Feb 2024 at 20:42, Chao Sun  wrote:

> Hi all,
>
> We are very happy to announce that Project Comet, a plugin to
> accelerate Spark query execution via leveraging DataFusion and Arrow,
> has now been open sourced under the Apache Arrow umbrella. Please
> check the project repo
> https://github.com/apache/arrow-datafusion-comet for more details if
> you are interested. We'd love to collaborate with people from the open
> source community who share similar goals.
>
> Thanks,
> Chao
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-15 Thread Mich Talebzadeh
Hi,I gather from the replies that the plugin is not currently available in
the form expected although I am aware of the shell script.

Also have you got some benchmark results from your tests that you can
possibly share?

Thanks,

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge, sourced from both personal expertise and other resources but of
course cannot be guaranteed . It is essential to note that, as with any
advice, one verified and tested result holds more weight than a thousand
expert opinions.


On Thu, 15 Feb 2024 at 01:18, Chao Sun  wrote:

> Hi Praveen,
>
> We will add a "Getting Started" section in the README soon, but basically
> comet-spark-shell
> 
>  in
> the repo should provide a basic tool to build Comet and launch a Spark
> shell with it.
>
> Note that we haven't open sourced several features yet including shuffle
> support, which the aggregate operation depends on. Please stay tuned!
>
> Chao
>
>
> On Wed, Feb 14, 2024 at 2:44 PM praveen sinha 
> wrote:
>
>> Hi Chao,
>>
>> Is there any example app/gist/repo which can help me use this plugin. I
>> wanted to try out some realtime aggregate performance on top of parquet and
>> spark dataframes.
>>
>> Thanks and Regards
>> Praveen
>>
>>
>> On Wed, Feb 14, 2024 at 9:20 AM Chao Sun  wrote:
>>
>>> > Out of interest what are the differences in the approach between this
>>> and Glutten?
>>>
>>> Overall they are similar, although Gluten supports multiple backends
>>> including Velox and Clickhouse. One major difference is (obviously)
>>> Comet is based on DataFusion and Arrow, and written in Rust, while
>>> Gluten is mostly C++.
>>> I haven't looked very deep into Gluten yet, but there could be other
>>> differences such as how strictly the engine follows Spark's semantics,
>>> table format support (Iceberg, Delta, etc), fallback mechanism
>>> (coarse-grained fallback on stage level or more fine-grained fallback
>>> within stages), UDF support (Comet hasn't started on this yet),
>>> shuffle support, memory management, etc.
>>>
>>> Both engines are backed by very strong and vibrant open source
>>> communities (Velox, Clickhouse, Arrow & DataFusion) so it's very
>>> exciting to see how the projects will grow in future.
>>>
>>> Best,
>>> Chao
>>>
>>> On Tue, Feb 13, 2024 at 10:06 PM John Zhuge  wrote:
>>> >
>>> > Congratulations! Excellent work!
>>> >
>>> > On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu  wrote:
>>> >>
>>> >> Absolutely thrilled to see the project going open-source! Huge
>>> congrats to Chao and the entire team on this milestone!
>>> >>
>>> >> Yufei
>>> >>
>>> >>
>>> >> On Tue, Feb 13, 2024 at 12:43 PM Chao Sun  wrote:
>>> >>>
>>> >>> Hi all,
>>> >>>
>>> >>> We are very happy to announce that Project Comet, a plugin to
>>> >>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>>> >>> has now been open sourced under the Apache Arrow umbrella. Please
>>> >>> check the project repo
>>> >>> https://github.com/apache/arrow-datafusion-comet for more details if
>>> >>> you are interested. We'd love to collaborate with people from the
>>> open
>>> >>> source community who share similar goals.
>>> >>>
>>> >>> Thanks,
>>> >>> Chao
>>> >>>
>>> >>> -
>>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>>
>>> >
>>> >
>>> > --
>>> > John Zhuge
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>


Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Chao Sun
Hi Praveen,

We will add a "Getting Started" section in the README soon, but basically
comet-spark-shell

in
the repo should provide a basic tool to build Comet and launch a Spark
shell with it.

Note that we haven't open sourced several features yet including shuffle
support, which the aggregate operation depends on. Please stay tuned!

Chao


On Wed, Feb 14, 2024 at 2:44 PM praveen sinha 
wrote:

> Hi Chao,
>
> Is there any example app/gist/repo which can help me use this plugin. I
> wanted to try out some realtime aggregate performance on top of parquet and
> spark dataframes.
>
> Thanks and Regards
> Praveen
>
>
> On Wed, Feb 14, 2024 at 9:20 AM Chao Sun  wrote:
>
>> > Out of interest what are the differences in the approach between this
>> and Glutten?
>>
>> Overall they are similar, although Gluten supports multiple backends
>> including Velox and Clickhouse. One major difference is (obviously)
>> Comet is based on DataFusion and Arrow, and written in Rust, while
>> Gluten is mostly C++.
>> I haven't looked very deep into Gluten yet, but there could be other
>> differences such as how strictly the engine follows Spark's semantics,
>> table format support (Iceberg, Delta, etc), fallback mechanism
>> (coarse-grained fallback on stage level or more fine-grained fallback
>> within stages), UDF support (Comet hasn't started on this yet),
>> shuffle support, memory management, etc.
>>
>> Both engines are backed by very strong and vibrant open source
>> communities (Velox, Clickhouse, Arrow & DataFusion) so it's very
>> exciting to see how the projects will grow in future.
>>
>> Best,
>> Chao
>>
>> On Tue, Feb 13, 2024 at 10:06 PM John Zhuge  wrote:
>> >
>> > Congratulations! Excellent work!
>> >
>> > On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu  wrote:
>> >>
>> >> Absolutely thrilled to see the project going open-source! Huge
>> congrats to Chao and the entire team on this milestone!
>> >>
>> >> Yufei
>> >>
>> >>
>> >> On Tue, Feb 13, 2024 at 12:43 PM Chao Sun  wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> We are very happy to announce that Project Comet, a plugin to
>> >>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>> >>> has now been open sourced under the Apache Arrow umbrella. Please
>> >>> check the project repo
>> >>> https://github.com/apache/arrow-datafusion-comet for more details if
>> >>> you are interested. We'd love to collaborate with people from the open
>> >>> source community who share similar goals.
>> >>>
>> >>> Thanks,
>> >>> Chao
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>
>> >
>> >
>> > --
>> > John Zhuge
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Liu(Laswift) Cao
This is very cool! Congrats on the amazing work Chao and the team!
It's exciting to see this native engine trend within the community. Other
than gluten, I ran into https://github.com/blaze-init/blaze as well (but
haven't evaluated it in detail)

On Wed, Feb 14, 2024 at 09:20 Chao Sun  wrote:

> > Out of interest what are the differences in the approach between this
> and Glutten?
>
> Overall they are similar, although Gluten supports multiple backends
> including Velox and Clickhouse. One major difference is (obviously)
> Comet is based on DataFusion and Arrow, and written in Rust, while
> Gluten is mostly C++.
> I haven't looked very deep into Gluten yet, but there could be other
> differences such as how strictly the engine follows Spark's semantics,
> table format support (Iceberg, Delta, etc), fallback mechanism
> (coarse-grained fallback on stage level or more fine-grained fallback
> within stages), UDF support (Comet hasn't started on this yet),
> shuffle support, memory management, etc.
>
> Both engines are backed by very strong and vibrant open source
> communities (Velox, Clickhouse, Arrow & DataFusion) so it's very
> exciting to see how the projects will grow in future.
>
> Best,
> Chao
>
> On Tue, Feb 13, 2024 at 10:06 PM John Zhuge  wrote:
> >
> > Congratulations! Excellent work!
> >
> > On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu  wrote:
> >>
> >> Absolutely thrilled to see the project going open-source! Huge congrats
> to Chao and the entire team on this milestone!
> >>
> >> Yufei
> >>
> >>
> >> On Tue, Feb 13, 2024 at 12:43 PM Chao Sun  wrote:
> >>>
> >>> Hi all,
> >>>
> >>> We are very happy to announce that Project Comet, a plugin to
> >>> accelerate Spark query execution via leveraging DataFusion and Arrow,
> >>> has now been open sourced under the Apache Arrow umbrella. Please
> >>> check the project repo
> >>> https://github.com/apache/arrow-datafusion-comet for more details if
> >>> you are interested. We'd love to collaborate with people from the open
> >>> source community who share similar goals.
> >>>
> >>> Thanks,
> >>> Chao
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >
> >
> > --
> > John Zhuge
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --

Liu Cao


Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Chao Sun
> Out of interest what are the differences in the approach between this and 
> Glutten?

Overall they are similar, although Gluten supports multiple backends
including Velox and Clickhouse. One major difference is (obviously)
Comet is based on DataFusion and Arrow, and written in Rust, while
Gluten is mostly C++.
I haven't looked very deep into Gluten yet, but there could be other
differences such as how strictly the engine follows Spark's semantics,
table format support (Iceberg, Delta, etc), fallback mechanism
(coarse-grained fallback on stage level or more fine-grained fallback
within stages), UDF support (Comet hasn't started on this yet),
shuffle support, memory management, etc.

Both engines are backed by very strong and vibrant open source
communities (Velox, Clickhouse, Arrow & DataFusion) so it's very
exciting to see how the projects will grow in future.

Best,
Chao

On Tue, Feb 13, 2024 at 10:06 PM John Zhuge  wrote:
>
> Congratulations! Excellent work!
>
> On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu  wrote:
>>
>> Absolutely thrilled to see the project going open-source! Huge congrats to 
>> Chao and the entire team on this milestone!
>>
>> Yufei
>>
>>
>> On Tue, Feb 13, 2024 at 12:43 PM Chao Sun  wrote:
>>>
>>> Hi all,
>>>
>>> We are very happy to announce that Project Comet, a plugin to
>>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>>> has now been open sourced under the Apache Arrow umbrella. Please
>>> check the project repo
>>> https://github.com/apache/arrow-datafusion-comet for more details if
>>> you are interested. We'd love to collaborate with people from the open
>>> source community who share similar goals.
>>>
>>> Thanks,
>>> Chao
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>
>
> --
> John Zhuge

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread John Zhuge
Congratulations! Excellent work!

On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu  wrote:

> Absolutely thrilled to see the project going open-source! Huge congrats to
> Chao and the entire team on this milestone!
>
> Yufei
>
>
> On Tue, Feb 13, 2024 at 12:43 PM Chao Sun  wrote:
>
>> Hi all,
>>
>> We are very happy to announce that Project Comet, a plugin to
>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>> has now been open sourced under the Apache Arrow umbrella. Please
>> check the project repo
>> https://github.com/apache/arrow-datafusion-comet for more details if
>> you are interested. We'd love to collaborate with people from the open
>> source community who share similar goals.
>>
>> Thanks,
>> Chao
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
John Zhuge


Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Yufei Gu
Absolutely thrilled to see the project going open-source! Huge congrats to
Chao and the entire team on this milestone!

Yufei


On Tue, Feb 13, 2024 at 12:43 PM Chao Sun  wrote:

> Hi all,
>
> We are very happy to announce that Project Comet, a plugin to
> accelerate Spark query execution via leveraging DataFusion and Arrow,
> has now been open sourced under the Apache Arrow umbrella. Please
> check the project repo
> https://github.com/apache/arrow-datafusion-comet for more details if
> you are interested. We'd love to collaborate with people from the open
> source community who share similar goals.
>
> Thanks,
> Chao
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Holden Karau
This looks really cool :) Out of interest what are the differences in the
approach between this and Glutten?

On Tue, Feb 13, 2024 at 12:42 PM Chao Sun  wrote:

> Hi all,
>
> We are very happy to announce that Project Comet, a plugin to
> accelerate Spark query execution via leveraging DataFusion and Arrow,
> has now been open sourced under the Apache Arrow umbrella. Please
> check the project repo
> https://github.com/apache/arrow-datafusion-comet for more details if
> you are interested. We'd love to collaborate with people from the open
> source community who share similar goals.
>
> Thanks,
> Chao
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>