Re: [DISCUSS] Spark 4.0.0 release

2024-04-16 Thread Cheng Pan
will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?

Thanks,
Cheng Pan


> On Apr 15, 2024, at 09:58, Jungtaek Lim  wrote:
> 
> W.r.t. state data source - reader (SPARK-45511), there are several follow-up 
> tickets, but we don't plan to address them soon. The current implementation 
> is the final shape for Spark 4.0.0, unless there are demands on the follow-up 
> tickets.
> 
> We may want to check the plan for transformWithState - my understanding is 
> that we want to release the feature to 4.0.0, but there are several remaining 
> works to be done. While the tentative timeline for releasing is June 2024, 
> what would be the tentative timeline for the RC cut?
> (cc. Anish to add more context on the plan for transformWithState)
> 
> On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan  wrote:
> Hi all,
> 
> It's close to the previously proposed 4.0.0 release date (June 2024), and I 
> think it's time to prepare for it and discuss the ongoing projects:
> • 
> ANSI by default
> • Spark Connect GA
> • Structured Logging
> • Streaming state store data source
> • new data type VARIANT
> • STRING collation support
> • Spark k8s operator versioning
> Please help to add more items to this list that are missed here. I would like 
> to volunteer as the release manager for Apache Spark 4.0.0 if there is no 
> objection. Thank you all for the great work that fills Spark 4.0!
> 
> Wenchen Fan


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread huaxin gao
+1

On Tue, Apr 16, 2024 at 6:55 PM Kent Yao  wrote:

> +1(non-binding)
>
> Thanks,
> Kent Yao
>
> bo yang  于2024年4月17日周三 09:49写道:
> >
> > +1
> >
> > On Tue, Apr 16, 2024 at 1:38 PM Hyukjin Kwon 
> wrote:
> >>
> >> +1
> >>
> >> On Wed, Apr 17, 2024 at 3:57 AM L. C. Hsieh  wrote:
> >>>
> >>> +1
> >>>
> >>> On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan 
> wrote:
> >>> >
> >>> > +1
> >>> >
> >>> > On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun 
> wrote:
> >>> >>
> >>> >> I'll start with my +1.
> >>> >>
> >>> >> - Checked checksum and signature
> >>> >> - Checked Scala/Java/R/Python/SQL Document's Spark version
> >>> >> - Checked published Maven artifacts
> >>> >> - All CIs passed.
> >>> >>
> >>> >> Thanks,
> >>> >> Dongjoon.
> >>> >>
> >>> >> On 2024/04/15 04:22:26 Dongjoon Hyun wrote:
> >>> >> > Please vote on releasing the following candidate as Apache Spark
> version
> >>> >> > 3.4.3.
> >>> >> >
> >>> >> > The vote is open until April 18th 1AM (PDT) and passes if a
> majority +1 PMC
> >>> >> > votes are cast, with a minimum of 3 +1 votes.
> >>> >> >
> >>> >> > [ ] +1 Release this package as Apache Spark 3.4.3
> >>> >> > [ ] -1 Do not release this package because ...
> >>> >> >
> >>> >> > To learn more about Apache Spark, please see
> https://spark.apache.org/
> >>> >> >
> >>> >> > The tag to be voted on is v3.4.3-rc2 (commit
> >>> >> > 1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f)
> >>> >> > https://github.com/apache/spark/tree/v3.4.3-rc2
> >>> >> >
> >>> >> > The release files, including signatures, digests, etc. can be
> found at:
> >>> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-bin/
> >>> >> >
> >>> >> > Signatures used for Spark RCs can be found in this file:
> >>> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>> >> >
> >>> >> > The staging repository for this release can be found at:
> >>> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1453/
> >>> >> >
> >>> >> > The documentation corresponding to this release can be found at:
> >>> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-docs/
> >>> >> >
> >>> >> > The list of bug fixes going into 3.4.3 can be found at the
> following URL:
> >>> >> > https://issues.apache.org/jira/projects/SPARK/versions/12353987
> >>> >> >
> >>> >> > This release is using the release script of the tag v3.4.3-rc2.
> >>> >> >
> >>> >> > FAQ
> >>> >> >
> >>> >> > =
> >>> >> > How can I help test this release?
> >>> >> > =
> >>> >> >
> >>> >> > If you are a Spark user, you can help us test this release by
> taking
> >>> >> > an existing Spark workload and running on this release candidate,
> then
> >>> >> > reporting any regressions.
> >>> >> >
> >>> >> > If you're working in PySpark you can set up a virtual env and
> install
> >>> >> > the current RC and see if anything important breaks, in the
> Java/Scala
> >>> >> > you can add the staging repository to your projects resolvers and
> test
> >>> >> > with the RC (make sure to clean up the artifact cache
> before/after so
> >>> >> > you don't end up building with a out of date RC going forward).
> >>> >> >
> >>> >> > ===
> >>> >> > What should happen to JIRA tickets still targeting 3.4.3?
> >>> >> > ===
> >>> >> >
> >>> >> > The current list of open tickets targeted at 3.4.3 can be found
> at:
> >>> >> > https://issues.apache.org/jira/projects/SPARK and search for
> "Target
> >>> >> > Version/s" = 3.4.3
> >>> >> >
> >>> >> > Committers should look at those and triage. Extremely important
> bug
> >>> >> > fixes, documentation, and API tweaks that impact compatibility
> should
> >>> >> > be worked on immediately. Everything else please retarget to an
> >>> >> > appropriate release.
> >>> >> >
> >>> >> > ==
> >>> >> > But my bug isn't fixed?
> >>> >> > ==
> >>> >> >
> >>> >> > In order to make timely releases, we will typically not hold the
> >>> >> > release unless the bug in question is a regression from the
> previous
> >>> >> > release. That being said, if there is something which is a
> regression
> >>> >> > that has not been correctly targeted please ping me or a
> committer to
> >>> >> > help target the issue.
> >>> >> >
> >>> >>
> >>> >>
> -
> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>> >>
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread Kent Yao
+1(non-binding)

Thanks,
Kent Yao

bo yang  于2024年4月17日周三 09:49写道:
>
> +1
>
> On Tue, Apr 16, 2024 at 1:38 PM Hyukjin Kwon  wrote:
>>
>> +1
>>
>> On Wed, Apr 17, 2024 at 3:57 AM L. C. Hsieh  wrote:
>>>
>>> +1
>>>
>>> On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan  wrote:
>>> >
>>> > +1
>>> >
>>> > On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun  
>>> > wrote:
>>> >>
>>> >> I'll start with my +1.
>>> >>
>>> >> - Checked checksum and signature
>>> >> - Checked Scala/Java/R/Python/SQL Document's Spark version
>>> >> - Checked published Maven artifacts
>>> >> - All CIs passed.
>>> >>
>>> >> Thanks,
>>> >> Dongjoon.
>>> >>
>>> >> On 2024/04/15 04:22:26 Dongjoon Hyun wrote:
>>> >> > Please vote on releasing the following candidate as Apache Spark 
>>> >> > version
>>> >> > 3.4.3.
>>> >> >
>>> >> > The vote is open until April 18th 1AM (PDT) and passes if a majority 
>>> >> > +1 PMC
>>> >> > votes are cast, with a minimum of 3 +1 votes.
>>> >> >
>>> >> > [ ] +1 Release this package as Apache Spark 3.4.3
>>> >> > [ ] -1 Do not release this package because ...
>>> >> >
>>> >> > To learn more about Apache Spark, please see https://spark.apache.org/
>>> >> >
>>> >> > The tag to be voted on is v3.4.3-rc2 (commit
>>> >> > 1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f)
>>> >> > https://github.com/apache/spark/tree/v3.4.3-rc2
>>> >> >
>>> >> > The release files, including signatures, digests, etc. can be found at:
>>> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-bin/
>>> >> >
>>> >> > Signatures used for Spark RCs can be found in this file:
>>> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >> >
>>> >> > The staging repository for this release can be found at:
>>> >> > https://repository.apache.org/content/repositories/orgapachespark-1453/
>>> >> >
>>> >> > The documentation corresponding to this release can be found at:
>>> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-docs/
>>> >> >
>>> >> > The list of bug fixes going into 3.4.3 can be found at the following 
>>> >> > URL:
>>> >> > https://issues.apache.org/jira/projects/SPARK/versions/12353987
>>> >> >
>>> >> > This release is using the release script of the tag v3.4.3-rc2.
>>> >> >
>>> >> > FAQ
>>> >> >
>>> >> > =
>>> >> > How can I help test this release?
>>> >> > =
>>> >> >
>>> >> > If you are a Spark user, you can help us test this release by taking
>>> >> > an existing Spark workload and running on this release candidate, then
>>> >> > reporting any regressions.
>>> >> >
>>> >> > If you're working in PySpark you can set up a virtual env and install
>>> >> > the current RC and see if anything important breaks, in the Java/Scala
>>> >> > you can add the staging repository to your projects resolvers and test
>>> >> > with the RC (make sure to clean up the artifact cache before/after so
>>> >> > you don't end up building with a out of date RC going forward).
>>> >> >
>>> >> > ===
>>> >> > What should happen to JIRA tickets still targeting 3.4.3?
>>> >> > ===
>>> >> >
>>> >> > The current list of open tickets targeted at 3.4.3 can be found at:
>>> >> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> >> > Version/s" = 3.4.3
>>> >> >
>>> >> > Committers should look at those and triage. Extremely important bug
>>> >> > fixes, documentation, and API tweaks that impact compatibility should
>>> >> > be worked on immediately. Everything else please retarget to an
>>> >> > appropriate release.
>>> >> >
>>> >> > ==
>>> >> > But my bug isn't fixed?
>>> >> > ==
>>> >> >
>>> >> > In order to make timely releases, we will typically not hold the
>>> >> > release unless the bug in question is a regression from the previous
>>> >> > release. That being said, if there is something which is a regression
>>> >> > that has not been correctly targeted please ping me or a committer to
>>> >> > help target the issue.
>>> >> >
>>> >>
>>> >> -
>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread DB Tsai
+1Sent from my iPhoneOn Apr 16, 2024, at 3:11 PM, bo yang  wrote:+1On Tue, Apr 16, 2024 at 1:38 PM Hyukjin Kwon  wrote:+1On Wed, Apr 17, 2024 at 3:57 AM L. C. Hsieh  wrote:+1

On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan  wrote:
>
> +1
>
> On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun  wrote:
>>
>> I'll start with my +1.
>>
>> - Checked checksum and signature
>> - Checked Scala/Java/R/Python/SQL Document's Spark version
>> - Checked published Maven artifacts
>> - All CIs passed.
>>
>> Thanks,
>> Dongjoon.
>>
>> On 2024/04/15 04:22:26 Dongjoon Hyun wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 3.4.3.
>> >
>> > The vote is open until April 18th 1AM (PDT) and passes if a majority +1 PMC
>> > votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.4.3
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see https://spark.apache.org/
>> >
>> > The tag to be voted on is v3.4.3-rc2 (commit
>> > 1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f)
>> > https://github.com/apache/spark/tree/v3.4.3-rc2
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1453/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-docs/
>> >
>> > The list of bug fixes going into 3.4.3 can be found at the following URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12353987
>> >
>> > This release is using the release script of the tag v3.4.3-rc2.
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 3.4.3?
>> > ===
>> >
>> > The current list of open tickets targeted at 3.4.3 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> > Version/s" = 3.4.3
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org





Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread bo yang
+1

On Tue, Apr 16, 2024 at 1:38 PM Hyukjin Kwon  wrote:

> +1
>
> On Wed, Apr 17, 2024 at 3:57 AM L. C. Hsieh  wrote:
>
>> +1
>>
>> On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan  wrote:
>> >
>> > +1
>> >
>> > On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun 
>> wrote:
>> >>
>> >> I'll start with my +1.
>> >>
>> >> - Checked checksum and signature
>> >> - Checked Scala/Java/R/Python/SQL Document's Spark version
>> >> - Checked published Maven artifacts
>> >> - All CIs passed.
>> >>
>> >> Thanks,
>> >> Dongjoon.
>> >>
>> >> On 2024/04/15 04:22:26 Dongjoon Hyun wrote:
>> >> > Please vote on releasing the following candidate as Apache Spark
>> version
>> >> > 3.4.3.
>> >> >
>> >> > The vote is open until April 18th 1AM (PDT) and passes if a majority
>> +1 PMC
>> >> > votes are cast, with a minimum of 3 +1 votes.
>> >> >
>> >> > [ ] +1 Release this package as Apache Spark 3.4.3
>> >> > [ ] -1 Do not release this package because ...
>> >> >
>> >> > To learn more about Apache Spark, please see
>> https://spark.apache.org/
>> >> >
>> >> > The tag to be voted on is v3.4.3-rc2 (commit
>> >> > 1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f)
>> >> > https://github.com/apache/spark/tree/v3.4.3-rc2
>> >> >
>> >> > The release files, including signatures, digests, etc. can be found
>> at:
>> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-bin/
>> >> >
>> >> > Signatures used for Spark RCs can be found in this file:
>> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >> >
>> >> > The staging repository for this release can be found at:
>> >> >
>> https://repository.apache.org/content/repositories/orgapachespark-1453/
>> >> >
>> >> > The documentation corresponding to this release can be found at:
>> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-docs/
>> >> >
>> >> > The list of bug fixes going into 3.4.3 can be found at the following
>> URL:
>> >> > https://issues.apache.org/jira/projects/SPARK/versions/12353987
>> >> >
>> >> > This release is using the release script of the tag v3.4.3-rc2.
>> >> >
>> >> > FAQ
>> >> >
>> >> > =
>> >> > How can I help test this release?
>> >> > =
>> >> >
>> >> > If you are a Spark user, you can help us test this release by taking
>> >> > an existing Spark workload and running on this release candidate,
>> then
>> >> > reporting any regressions.
>> >> >
>> >> > If you're working in PySpark you can set up a virtual env and install
>> >> > the current RC and see if anything important breaks, in the
>> Java/Scala
>> >> > you can add the staging repository to your projects resolvers and
>> test
>> >> > with the RC (make sure to clean up the artifact cache before/after so
>> >> > you don't end up building with a out of date RC going forward).
>> >> >
>> >> > ===
>> >> > What should happen to JIRA tickets still targeting 3.4.3?
>> >> > ===
>> >> >
>> >> > The current list of open tickets targeted at 3.4.3 can be found at:
>> >> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> >> > Version/s" = 3.4.3
>> >> >
>> >> > Committers should look at those and triage. Extremely important bug
>> >> > fixes, documentation, and API tweaks that impact compatibility should
>> >> > be worked on immediately. Everything else please retarget to an
>> >> > appropriate release.
>> >> >
>> >> > ==
>> >> > But my bug isn't fixed?
>> >> > ==
>> >> >
>> >> > In order to make timely releases, we will typically not hold the
>> >> > release unless the bug in question is a regression from the previous
>> >> > release. That being said, if there is something which is a regression
>> >> > that has not been correctly targeted please ping me or a committer to
>> >> > help target the issue.
>> >> >
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Configuration to disable file exists in DataSource

2024-04-16 Thread Romain Ardiet
Hi community,

When using DataFrameReader to read parquet files located on s3, there is no
way to disable file existence checks done by the driver.

My use case is that I have a spark job reading list of s3 files generated
by an upstream job. This list can contain thousands of files.

The process is multi-threaded thanks to
https://issues.apache.org/jira/browse/SPARK-29089 but is redundant in my
case as the upstream job already verified files.

Would it make sense to add an option to control it?

Thanks,
Romain Ardiet


Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread Hyukjin Kwon
+1

On Wed, Apr 17, 2024 at 3:57 AM L. C. Hsieh  wrote:

> +1
>
> On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan  wrote:
> >
> > +1
> >
> > On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun 
> wrote:
> >>
> >> I'll start with my +1.
> >>
> >> - Checked checksum and signature
> >> - Checked Scala/Java/R/Python/SQL Document's Spark version
> >> - Checked published Maven artifacts
> >> - All CIs passed.
> >>
> >> Thanks,
> >> Dongjoon.
> >>
> >> On 2024/04/15 04:22:26 Dongjoon Hyun wrote:
> >> > Please vote on releasing the following candidate as Apache Spark
> version
> >> > 3.4.3.
> >> >
> >> > The vote is open until April 18th 1AM (PDT) and passes if a majority
> +1 PMC
> >> > votes are cast, with a minimum of 3 +1 votes.
> >> >
> >> > [ ] +1 Release this package as Apache Spark 3.4.3
> >> > [ ] -1 Do not release this package because ...
> >> >
> >> > To learn more about Apache Spark, please see
> https://spark.apache.org/
> >> >
> >> > The tag to be voted on is v3.4.3-rc2 (commit
> >> > 1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f)
> >> > https://github.com/apache/spark/tree/v3.4.3-rc2
> >> >
> >> > The release files, including signatures, digests, etc. can be found
> at:
> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-bin/
> >> >
> >> > Signatures used for Spark RCs can be found in this file:
> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >> >
> >> > The staging repository for this release can be found at:
> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1453/
> >> >
> >> > The documentation corresponding to this release can be found at:
> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-docs/
> >> >
> >> > The list of bug fixes going into 3.4.3 can be found at the following
> URL:
> >> > https://issues.apache.org/jira/projects/SPARK/versions/12353987
> >> >
> >> > This release is using the release script of the tag v3.4.3-rc2.
> >> >
> >> > FAQ
> >> >
> >> > =
> >> > How can I help test this release?
> >> > =
> >> >
> >> > If you are a Spark user, you can help us test this release by taking
> >> > an existing Spark workload and running on this release candidate, then
> >> > reporting any regressions.
> >> >
> >> > If you're working in PySpark you can set up a virtual env and install
> >> > the current RC and see if anything important breaks, in the Java/Scala
> >> > you can add the staging repository to your projects resolvers and test
> >> > with the RC (make sure to clean up the artifact cache before/after so
> >> > you don't end up building with a out of date RC going forward).
> >> >
> >> > ===
> >> > What should happen to JIRA tickets still targeting 3.4.3?
> >> > ===
> >> >
> >> > The current list of open tickets targeted at 3.4.3 can be found at:
> >> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> >> > Version/s" = 3.4.3
> >> >
> >> > Committers should look at those and triage. Extremely important bug
> >> > fixes, documentation, and API tweaks that impact compatibility should
> >> > be worked on immediately. Everything else please retarget to an
> >> > appropriate release.
> >> >
> >> > ==
> >> > But my bug isn't fixed?
> >> > ==
> >> >
> >> > In order to make timely releases, we will typically not hold the
> >> > release unless the bug in question is a regression from the previous
> >> > release. That being said, if there is something which is a regression
> >> > that has not been correctly targeted please ping me or a committer to
> >> > help target the issue.
> >> >
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Which version of spark version supports parquet version 2 ?

2024-04-16 Thread Mich Talebzadeh
Hi Prem,

Regrettably this is not my area of speciality. I trust another colleague
will have a more informed idea. Alternatively you may raise an SPIP for it.

Spark Project Improvement Proposals (SPIP) | Apache Spark


HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 16 Apr 2024 at 18:17, Prem Sahoo  wrote:

> Hello Mich,
> Thanks for example.
> I have the same parquet-mr version which creates Parquet version 1. We
> need to create V2 as it is more optimized. We have Dremio where if we use
> Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % better
> in case of write . so we are inclined towards this way.  Please let us know
> why Spark is not going towards Parquet V2 ?
> Sent from my iPhone
>
> On Apr 16, 2024, at 1:04 PM, Mich Talebzadeh 
> wrote:
>
> 
> Well let us do a test in PySpark.
>
> Take this code and create a default parquet file. My spark is 3.4
>
> cat parquet_checxk.py
> from pyspark.sql import SparkSession
>
> spark = SparkSession.builder.appName("ParquetVersionExample").getOrCreate()
>
> data = [("London", 8974432), ("New York City", 8804348), ("Beijing",
> 21893000)]
> df = spark.createDataFrame(data, ["city", "population"])
>
> df.write.mode("overwrite").parquet("parquet_example")  # it create file
> in hdfs directory
>
> Use a tool called parquet-tools (downloadable using pip from
> https://pypi.org/project/parquet-tools/)
>
> Get the parquet files from hdfs to the current directory say
>
> hdfs dfs -get /user/hduser/parquet_example .
> cd ./parquet_example
> do an ls and pickup file 3 like below to inspect
>  parquet-tools inspect
> part-3-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquet
>
> Now this is the output
>
>  file meta data 
> created_by: parquet-mr version 1.12.3 (build
> f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
> num_columns: 2
> num_rows: 1
> num_row_groups: 1
> format_version: 1.0
> serialized_size: 563
>
>
>  Columns 
> name
> age
>
>  Column(name) 
> name: name
> path: name
> max_definition_level: 1
> max_repetition_level: 0
> physical_type: BYTE_ARRAY
> logical_type: String
> converted_type (legacy): UTF8
> compression: SNAPPY (space_saved: -5%)
>
>  Column(age) 
> name: age
> path: age
> max_definition_level: 1
> max_repetition_level: 0
> physical_type: INT64
> logical_type: None
> converted_type (legacy): NONE
> compression: SNAPPY (space_saved: -5%)
>
> File Information:
>
>- format_version: 1.0: This line explicitly states that the format
>version of the Parquet file is 1.0, which corresponds to Parquet version 1.
>- created_by: parquet-mr version 1.12.3: While this doesn't directly
>specify the format version, itt is accepted that older versions of
>parquet-mr like 1.12.3 typically write Parquet version 1 files.
>
> Since in this case Spark 3.4 is capable of reading both versions (1 and
> 2), you don't  necessarily need to modify your Spark code to access this
> file. However, if you want to create Parquet files in version 2 using
> Spark, you might need to consider additional changes like excluding
> parquet-mr or upgrading Parquet libraries and do a custom build.of Spark.
> However, taking klaws of diminishing returns, I would not advise that
> either.. You can ofcourse usse gzip for compression that may be more
> suitable for your needs.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Tue, 16 Apr 2024 at 15:00, Prem Sahoo  wrote:
>
>> Hello Community,
>> Could any of you shed some light on below questions please ?
>> Sent from my iPhone
>>
>> On Apr 15, 2024, at 9:02 PM, Prem Sahoo  wrote:
>>
>> 
>> Any specific reason spark does not support or community doesn't want to
>> go to Parquet V2 , which is more optimized and read and 

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread Gengliang Wang
+1

On Tue, Apr 16, 2024 at 11:57 AM L. C. Hsieh  wrote:

> +1
>
> On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan  wrote:
> >
> > +1
> >
> > On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun 
> wrote:
> >>
> >> I'll start with my +1.
> >>
> >> - Checked checksum and signature
> >> - Checked Scala/Java/R/Python/SQL Document's Spark version
> >> - Checked published Maven artifacts
> >> - All CIs passed.
> >>
> >> Thanks,
> >> Dongjoon.
> >>
> >> On 2024/04/15 04:22:26 Dongjoon Hyun wrote:
> >> > Please vote on releasing the following candidate as Apache Spark
> version
> >> > 3.4.3.
> >> >
> >> > The vote is open until April 18th 1AM (PDT) and passes if a majority
> +1 PMC
> >> > votes are cast, with a minimum of 3 +1 votes.
> >> >
> >> > [ ] +1 Release this package as Apache Spark 3.4.3
> >> > [ ] -1 Do not release this package because ...
> >> >
> >> > To learn more about Apache Spark, please see
> https://spark.apache.org/
> >> >
> >> > The tag to be voted on is v3.4.3-rc2 (commit
> >> > 1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f)
> >> > https://github.com/apache/spark/tree/v3.4.3-rc2
> >> >
> >> > The release files, including signatures, digests, etc. can be found
> at:
> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-bin/
> >> >
> >> > Signatures used for Spark RCs can be found in this file:
> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >> >
> >> > The staging repository for this release can be found at:
> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1453/
> >> >
> >> > The documentation corresponding to this release can be found at:
> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-docs/
> >> >
> >> > The list of bug fixes going into 3.4.3 can be found at the following
> URL:
> >> > https://issues.apache.org/jira/projects/SPARK/versions/12353987
> >> >
> >> > This release is using the release script of the tag v3.4.3-rc2.
> >> >
> >> > FAQ
> >> >
> >> > =
> >> > How can I help test this release?
> >> > =
> >> >
> >> > If you are a Spark user, you can help us test this release by taking
> >> > an existing Spark workload and running on this release candidate, then
> >> > reporting any regressions.
> >> >
> >> > If you're working in PySpark you can set up a virtual env and install
> >> > the current RC and see if anything important breaks, in the Java/Scala
> >> > you can add the staging repository to your projects resolvers and test
> >> > with the RC (make sure to clean up the artifact cache before/after so
> >> > you don't end up building with a out of date RC going forward).
> >> >
> >> > ===
> >> > What should happen to JIRA tickets still targeting 3.4.3?
> >> > ===
> >> >
> >> > The current list of open tickets targeted at 3.4.3 can be found at:
> >> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> >> > Version/s" = 3.4.3
> >> >
> >> > Committers should look at those and triage. Extremely important bug
> >> > fixes, documentation, and API tweaks that impact compatibility should
> >> > be worked on immediately. Everything else please retarget to an
> >> > appropriate release.
> >> >
> >> > ==
> >> > But my bug isn't fixed?
> >> > ==
> >> >
> >> > In order to make timely releases, we will typically not hold the
> >> > release unless the bug in question is a regression from the previous
> >> > release. That being said, if there is something which is a regression
> >> > that has not been correctly targeted please ping me or a committer to
> >> > help target the issue.
> >> >
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread L. C. Hsieh
+1

On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan  wrote:
>
> +1
>
> On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun  wrote:
>>
>> I'll start with my +1.
>>
>> - Checked checksum and signature
>> - Checked Scala/Java/R/Python/SQL Document's Spark version
>> - Checked published Maven artifacts
>> - All CIs passed.
>>
>> Thanks,
>> Dongjoon.
>>
>> On 2024/04/15 04:22:26 Dongjoon Hyun wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 3.4.3.
>> >
>> > The vote is open until April 18th 1AM (PDT) and passes if a majority +1 PMC
>> > votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.4.3
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see https://spark.apache.org/
>> >
>> > The tag to be voted on is v3.4.3-rc2 (commit
>> > 1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f)
>> > https://github.com/apache/spark/tree/v3.4.3-rc2
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1453/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-docs/
>> >
>> > The list of bug fixes going into 3.4.3 can be found at the following URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12353987
>> >
>> > This release is using the release script of the tag v3.4.3-rc2.
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 3.4.3?
>> > ===
>> >
>> > The current list of open tickets targeted at 3.4.3 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> > Version/s" = 3.4.3
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Which version of spark version supports parquet version 2 ?

2024-04-16 Thread Prem Sahoo
Hello Mich,Thanks for example.I have the same parquet-mr version which creates Parquet version 1. We need to create V2 as it is more optimized. We have Dremio where if we use Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % better in case of write . so we are inclined towards this way.  Please let us know why Spark is not going towards Parquet V2 ? Sent from my iPhoneOn Apr 16, 2024, at 1:04 PM, Mich Talebzadeh  wrote:Well let us do a test in PySpark. Take this code and create a default parquet file. My spark is 3.4cat parquet_checxk.pyfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName("ParquetVersionExample").getOrCreate()data = "" 8974432), ("New York City", 8804348), ("Beijing", 21893000)]df = spark.createDataFrame(data, ["city", "population"]) df.write.mode("overwrite").parquet("parquet_example")  # it create file in hdfs directoryUse a tool called parquet-tools (downloadable using pip from https://pypi.org/project/parquet-tools/)Get the parquet files from hdfs to the current directory sayhdfs dfs -get /user/hduser/parquet_example .cd ./parquet_exampledo an ls and pickup file 3 like below to inspect parquet-tools inspect part-3-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquetNow this is the output file meta data created_by: parquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)num_columns: 2num_rows: 1num_row_groups: 1format_version: 1.0serialized_size: 563 Columns nameage Column(name) name: namepath: namemax_definition_level: 1max_repetition_level: 0physical_type: BYTE_ARRAYlogical_type: Stringconverted_type (legacy): UTF8compression: SNAPPY (space_saved: -5%) Column(age) name: agepath: agemax_definition_level: 1max_repetition_level: 0physical_type: INT64logical_type: Noneconverted_type (legacy): NONEcompression: SNAPPY (space_saved: -5%)File Information:format_version: 1.0: This line explicitly states that the format version of the Parquet file is 1.0, which corresponds to Parquet version 1.created_by: parquet-mr version 1.12.3: While this doesn't directly specify the format version, itt is accepted that older versions of parquet-mr like 1.12.3 typically write Parquet version 1 files.Since in this case Spark 3.4 is capable of reading both versions (1 and 2), you don't  necessarily need to modify your Spark code to access this file. However, if you want to create Parquet files in version 2 using Spark, you might need to consider additional changes like excluding parquet-mr or upgrading Parquet libraries and do a custom build.of Spark. However, taking klaws of diminishing returns, I would not advise that either.. You can ofcourse usse gzip for compression that may be more suitable for your needs.HTH Mich Talebzadeh,Technologist | Solutions Architect | Data Engineer  | Generative AILondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

On Tue, 16 Apr 2024 at 15:00, Prem Sahoo  wrote:Hello Community,Could any of you shed some light on below questions please ?Sent from my iPhoneOn Apr 15, 2024, at 9:02 PM, Prem Sahoo  wrote:Any specific reason spark does not support or community doesn't want to go to Parquet V2 , which is more optimized and read and write is too much faster (form other component which I am using)On Mon, Apr 15, 2024 at 7:55 PM Ryan Blue  wrote:Spark will read data written with v2 encodings just fine. You just don't need to worry about making Spark produce v2. And you should probably also not produce v2 encodings from other systems.On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo  wrote:oops but so spark does not support parquet V2  atm ?, as We have a use case where we need parquet V2 as  one of our components uses Parquet V2 .On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue  wrote:Hi Prem,Parquet v1 is the default because v2 has not been finalized and adopted by the community. I highly recommend not using v2 encodings at this time.RyanOn Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo  wrote:I am using spark 3.2.0 . but my spark package comes with parquet-mr 1.2.1 which writes in parquet version 1 not version version 2:(. so I was looking how to write in Parquet version2 ?On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh  wrote:Sorry you have a point there. It was released in version 3.00. What version of spark are you using?Technologist | Solutions Architect | Data Engineer  | Generative AILondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: The 

Re: Which version of spark version supports parquet version 2 ?

2024-04-16 Thread Mich Talebzadeh
Well let us do a test in PySpark.

Take this code and create a default parquet file. My spark is 3.4

cat parquet_checxk.py
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ParquetVersionExample").getOrCreate()

data = [("London", 8974432), ("New York City", 8804348), ("Beijing",
21893000)]
df = spark.createDataFrame(data, ["city", "population"])

df.write.mode("overwrite").parquet("parquet_example")  # it create file in
hdfs directory

Use a tool called parquet-tools (downloadable using pip from
https://pypi.org/project/parquet-tools/)

Get the parquet files from hdfs to the current directory say

hdfs dfs -get /user/hduser/parquet_example .
cd ./parquet_example
do an ls and pickup file 3 like below to inspect
 parquet-tools inspect
part-3-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquet

Now this is the output

 file meta data 
created_by: parquet-mr version 1.12.3 (build
f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
num_columns: 2
num_rows: 1
num_row_groups: 1
format_version: 1.0
serialized_size: 563


 Columns 
name
age

 Column(name) 
name: name
path: name
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: -5%)

 Column(age) 
name: age
path: age
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -5%)

File Information:

   - format_version: 1.0: This line explicitly states that the format
   version of the Parquet file is 1.0, which corresponds to Parquet version 1.
   - created_by: parquet-mr version 1.12.3: While this doesn't directly
   specify the format version, itt is accepted that older versions of
   parquet-mr like 1.12.3 typically write Parquet version 1 files.

Since in this case Spark 3.4 is capable of reading both versions (1 and 2),
you don't  necessarily need to modify your Spark code to access this file.
However, if you want to create Parquet files in version 2 using Spark, you
might need to consider additional changes like excluding parquet-mr or
upgrading Parquet libraries and do a custom build.of Spark. However, taking
klaws of diminishing returns, I would not advise that either.. You can
ofcourse usse gzip for compression that may be more suitable for your needs.

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 16 Apr 2024 at 15:00, Prem Sahoo  wrote:

> Hello Community,
> Could any of you shed some light on below questions please ?
> Sent from my iPhone
>
> On Apr 15, 2024, at 9:02 PM, Prem Sahoo  wrote:
>
> 
> Any specific reason spark does not support or community doesn't want to go
> to Parquet V2 , which is more optimized and read and write is too much
> faster (form other component which I am using)
>
> On Mon, Apr 15, 2024 at 7:55 PM Ryan Blue  wrote:
>
>> Spark will read data written with v2 encodings just fine. You just don't
>> need to worry about making Spark produce v2. And you should probably also
>> not produce v2 encodings from other systems.
>>
>> On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo  wrote:
>>
>>> oops but so spark does not support parquet V2  atm ?, as We have a use
>>> case where we need parquet V2 as  one of our components uses Parquet V2 .
>>>
>>> On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue  wrote:
>>>
 Hi Prem,

 Parquet v1 is the default because v2 has not been finalized and adopted
 by the community. I highly recommend not using v2 encodings at this time.

 Ryan

 On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo 
 wrote:

> I am using spark 3.2.0 . but my spark package comes with parquet-mr
> 1.2.1 which writes in parquet version 1 not version version 2:(. so I was
> looking how to write in Parquet version2 ?
>
> On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Sorry you have a point there. It was released in version 3.00. What
>> version of spark are you using?
>>
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh

Re: Which version of spark version supports parquet version 2 ?

2024-04-16 Thread Prem Sahoo
Hello Community,Could any of you shed some light on below questions please ?Sent from my iPhoneOn Apr 15, 2024, at 9:02 PM, Prem Sahoo  wrote:Any specific reason spark does not support or community doesn't want to go to Parquet V2 , which is more optimized and read and write is too much faster (form other component which I am using)On Mon, Apr 15, 2024 at 7:55 PM Ryan Blue  wrote:Spark will read data written with v2 encodings just fine. You just don't need to worry about making Spark produce v2. And you should probably also not produce v2 encodings from other systems.On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo  wrote:oops but so spark does not support parquet V2  atm ?, as We have a use case where we need parquet V2 as  one of our components uses Parquet V2 .On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue  wrote:Hi Prem,Parquet v1 is the default because v2 has not been finalized and adopted by the community. I highly recommend not using v2 encodings at this time.RyanOn Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo  wrote:I am using spark 3.2.0 . but my spark package comes with parquet-mr 1.2.1 which writes in parquet version 1 not version version 2:(. so I was looking how to write in Parquet version2 ?On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh  wrote:Sorry you have a point there. It was released in version 3.00. What version of spark are you using?Technologist | Solutions Architect | Data Engineer  | Generative AILondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

On Mon, 15 Apr 2024 at 21:33, Prem Sahoo  wrote:Thank you so much for the info! But do we have any release notes where it says spark2.4.0 onwards supports parquet version 2. I was under the impression Spark3.0 onwards it started supporting .On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh  wrote:Well if I am correct, Parquet version 2 support was introduced in Spark version 2.4.0. Therefore, any version of Spark starting from 2.4.0 supports Parquet version 2. Assuming that you are using Spark version  2.4.0 or later, you should be able to take advantage of Parquet version 2 features.HTH

Mich Talebzadeh,Technologist | Solutions Architect | Data Engineer  | Generative AILondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

On Mon, 15 Apr 2024 at 20:53, Prem Sahoo  wrote:Thank you for the information! I can use any version of parquet-mr to produce parquet file.regarding 2nd question .Which version of spark is supporting parquet version 2?May I get the release notes where parquet versions are mentioned ?On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh  wrote:Parquet-mr is a Java library that provides functionality for working with Parquet files with hadoop. It is therefore  more geared towards working with Parquet files within the Hadoop ecosystem, particularly using MapReduce jobs. There is no definitive way to check exact compatible versions within the library itself. However, you can have a look at thishttps://github.com/apache/parquet-mr/blob/master/CHANGES.mdHTH

Mich Talebzadeh,Technologist | Solutions Architect | Data Engineer  | Generative AILondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

On Mon, 15 Apr 2024 at 18:59, Prem Sahoo  wrote:Hello Team,May I know how to check which version of parquet is supported by parquet-mr 1.2.1 ?Which version of parquet-mr is supporting parquet version 2 (V2) ?Which version of spark is supporting parquet version 2?May I get the release notes where parquet versions are mentioned ?






-- Ryan BlueTabular

-- Ryan BlueTabular



Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread Wenchen Fan
+1

On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun  wrote:

> I'll start with my +1.
>
> - Checked checksum and signature
> - Checked Scala/Java/R/Python/SQL Document's Spark version
> - Checked published Maven artifacts
> - All CIs passed.
>
> Thanks,
> Dongjoon.
>
> On 2024/04/15 04:22:26 Dongjoon Hyun wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 3.4.3.
> >
> > The vote is open until April 18th 1AM (PDT) and passes if a majority +1
> PMC
> > votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.4.3
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see https://spark.apache.org/
> >
> > The tag to be voted on is v3.4.3-rc2 (commit
> > 1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f)
> > https://github.com/apache/spark/tree/v3.4.3-rc2
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1453/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-docs/
> >
> > The list of bug fixes going into 3.4.3 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12353987
> >
> > This release is using the release script of the tag v3.4.3-rc2.
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 3.4.3?
> > ===
> >
> > The current list of open tickets targeted at 3.4.3 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> > Version/s" = 3.4.3
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>