Re: Which version of spark version supports parquet version 2 ?

2024-04-18 Thread Prem Sahoo
Thanks for below information. Sent from my iPhoneOn Apr 18, 2024, at 3:31 AM, Bjørn Jørgensen  wrote:"

Release 24.3 of Dremio will continue to write Parquet V1, since an average performance degradation of 1.5% was observed in writes and 6.5% was observed in queries when TPC-DS data was written using Parquet V2 instead of Parquet V1.  The aforementioned query performance tests utilized the C3 cache to store data."(...)"Users can enable Parquet V2 on write using the following configuration key.ALTER SYSTEM SET "store.parquet.writer.version" = 'v2' "https://www.dremio.com/blog/vectorized-reading-of-parquet-v2-improves-performance-up-to-75/"Java Vector API supportThe feature is experimental and is currently not part of the parquet distribution. Parquet-MR has supported Java Vector API to speed up reading, to enable this feature:Java 17+, 64-bitRequiring the CPU to support instruction sets:avx512vbmiavx512_vbmi2To build the jars: mvn clean package -P vector-pluginsFor Apache Spark to enable this feature:Build parquet and replace the parquet-encoding-{VERSION}.jar on the spark jars folderBuild parquet-encoding-vector and copy parquet-encoding-vector-{VERSION}.jar to the spark jars folderEdit spark class#VectorizedRleValuesReader, function#readNextGroup refer to parquet class#ParquetReadRouter, function#readBatchUsing512VectorBuild spark with maven and replace spark-sql_2.12-{VERSION}.jar on the spark jars folder"https://github.com/apache/parquet-mr?tab=readme-ov-file#java-vector-api-supportYou are using spark 3.2.0spark version 3.2.4 was released April 13, 2023 https://spark.apache.org/releases/spark-release-3-2-4.htmlYou are using a spark version that is EOL.tor. 18. apr. 2024 kl. 00:25 skrev Prem Sahoo :Hello Ryan,May I know how you can write Parquet V2 encoding from spark 3.2.0 ?  As per my knowledge Dremio is creating and reading Parquet V2. "Apache Parquet-MR Writer version PARQUET_2_0, which is widely adopted by engines that write Parquet data, supports delta encodings. However, these encodings were not previously supported by Dremio's vectorized Parquet reader, resulting in decreased speed. Now, in version 24.3 and Dremio Cloud, when you use the Dremio SQL query engine on Parquet datasets, you’ll receive best-in-class performance." Could you let me know where Parquet Community is not recommending Parquet V2 ?On Wed, Apr 17, 2024 at 2:44 PM Ryan Blue  wrote:Prem, as I said earlier, v2 is not a finalized spec so you should not use it. That's why it is not the default. You can get Spark to write v2 files, but it isn't recommended by the Parquet community.On Wed, Apr 17, 2024 at 11:05 AM Prem Sahoo  wrote:Hello Community,Could anyone shed more light on this (Spark Supporting Parquet V2)? On Tue, Apr 16, 2024 at 3:42 PM Mich Talebzadeh  wrote:Hi Prem,Regrettably this is not my area of speciality. I trust another colleague will have a more informed idea. Alternatively you may raise an SPIP for it.Spark Project Improvement Proposals (SPIP) | Apache SparkHTH

Mich Talebzadeh,Technologist | Solutions Architect | Data Engineer  | Generative AILondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

On Tue, 16 Apr 2024 at 18:17, Prem Sahoo  wrote:Hello Mich,Thanks for example.I have the same parquet-mr version which creates Parquet version 1. We need to create V2 as it is more optimized. We have Dremio where if we use Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % better in case of write . so we are inclined towards this way.  Please let us know why Spark is not going towards Parquet V2 ? Sent from my iPhoneOn Apr 16, 2024, at 1:04 PM, Mich Talebzadeh  wrote:Well let us do a test in PySpark. Take this code and create a default parquet file. My spark is 3.4cat parquet_checxk.pyfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName("ParquetVersionExample").getOrCreate()data = "" 8974432), ("New York City", 8804348), ("Beijing", 21893000)]df = spark.createDataFrame(data, ["city", "population"]) df.write.mode("overwrite").parquet("parquet_example")  # it create file in hdfs directoryUse a tool called parquet-tools (downloadable using pip from https://pypi.org/project/parquet-tools/)Get the parquet files from hdfs to the current directory sayhdfs dfs -get /user/hduser/parquet_example .cd ./parquet_exampledo an ls and pickup file 3 like below to inspect parquet-tools inspect part-3-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquetNow this is the output file meta data created_by: parquet-mr version 1.12.3 (build 

[ANNOUNCE] Apache Spark 3.4.3 released

2024-04-18 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.3!

Spark 3.4.3 is a maintenance release containing many fixes including
security and correctness domains. This release is based on the
branch-3.4 maintenance branch of Spark. We strongly
recommend all 3.4 users to upgrade to this stable release.

To download Spark 3.4.3, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-4-3.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun


[VOTE][RESULT] Release Spark 3.4.3 (RC2)

2024-04-18 Thread Dongjoon Hyun
The vote passes with 10 +1s (8 binding +1s).
Thanks to all who helped with the release!

(* = binding)
+1:
- Dongjoon Hyun *
- Mridul Muralidharan *
- Wenchen Fan *
- Liang-Chi Hsieh *
- Gengliang Wang *
- Hyukjin Kwon *
- Bo Yang
- DB Tsai *
- Kent Yao
- Huaxin Gao *

+0: None

-1: None


Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-18 Thread Dongjoon Hyun
This vote passed.

I'll conclude this vote.

Dongjoon

On 2024/04/17 03:11:36 huaxin gao wrote:
> +1
> 
> On Tue, Apr 16, 2024 at 6:55 PM Kent Yao  wrote:
> 
> > +1(non-binding)
> >
> > Thanks,
> > Kent Yao
> >
> > bo yang  于2024年4月17日周三 09:49写道:
> > >
> > > +1
> > >
> > > On Tue, Apr 16, 2024 at 1:38 PM Hyukjin Kwon 
> > wrote:
> > >>
> > >> +1
> > >>
> > >> On Wed, Apr 17, 2024 at 3:57 AM L. C. Hsieh  wrote:
> > >>>
> > >>> +1
> > >>>
> > >>> On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan 
> > wrote:
> > >>> >
> > >>> > +1
> > >>> >
> > >>> > On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun 
> > wrote:
> > >>> >>
> > >>> >> I'll start with my +1.
> > >>> >>
> > >>> >> - Checked checksum and signature
> > >>> >> - Checked Scala/Java/R/Python/SQL Document's Spark version
> > >>> >> - Checked published Maven artifacts
> > >>> >> - All CIs passed.
> > >>> >>
> > >>> >> Thanks,
> > >>> >> Dongjoon.
> > >>> >>
> > >>> >> On 2024/04/15 04:22:26 Dongjoon Hyun wrote:
> > >>> >> > Please vote on releasing the following candidate as Apache Spark
> > version
> > >>> >> > 3.4.3.
> > >>> >> >
> > >>> >> > The vote is open until April 18th 1AM (PDT) and passes if a
> > majority +1 PMC
> > >>> >> > votes are cast, with a minimum of 3 +1 votes.
> > >>> >> >
> > >>> >> > [ ] +1 Release this package as Apache Spark 3.4.3
> > >>> >> > [ ] -1 Do not release this package because ...
> > >>> >> >
> > >>> >> > To learn more about Apache Spark, please see
> > https://spark.apache.org/
> > >>> >> >
> > >>> >> > The tag to be voted on is v3.4.3-rc2 (commit
> > >>> >> > 1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f)
> > >>> >> > https://github.com/apache/spark/tree/v3.4.3-rc2
> > >>> >> >
> > >>> >> > The release files, including signatures, digests, etc. can be
> > found at:
> > >>> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-bin/
> > >>> >> >
> > >>> >> > Signatures used for Spark RCs can be found in this file:
> > >>> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> > >>> >> >
> > >>> >> > The staging repository for this release can be found at:
> > >>> >> >
> > https://repository.apache.org/content/repositories/orgapachespark-1453/
> > >>> >> >
> > >>> >> > The documentation corresponding to this release can be found at:
> > >>> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-docs/
> > >>> >> >
> > >>> >> > The list of bug fixes going into 3.4.3 can be found at the
> > following URL:
> > >>> >> > https://issues.apache.org/jira/projects/SPARK/versions/12353987
> > >>> >> >
> > >>> >> > This release is using the release script of the tag v3.4.3-rc2.
> > >>> >> >
> > >>> >> > FAQ
> > >>> >> >
> > >>> >> > =
> > >>> >> > How can I help test this release?
> > >>> >> > =
> > >>> >> >
> > >>> >> > If you are a Spark user, you can help us test this release by
> > taking
> > >>> >> > an existing Spark workload and running on this release candidate,
> > then
> > >>> >> > reporting any regressions.
> > >>> >> >
> > >>> >> > If you're working in PySpark you can set up a virtual env and
> > install
> > >>> >> > the current RC and see if anything important breaks, in the
> > Java/Scala
> > >>> >> > you can add the staging repository to your projects resolvers and
> > test
> > >>> >> > with the RC (make sure to clean up the artifact cache
> > before/after so
> > >>> >> > you don't end up building with a out of date RC going forward).
> > >>> >> >
> > >>> >> > ===
> > >>> >> > What should happen to JIRA tickets still targeting 3.4.3?
> > >>> >> > ===
> > >>> >> >
> > >>> >> > The current list of open tickets targeted at 3.4.3 can be found
> > at:
> > >>> >> > https://issues.apache.org/jira/projects/SPARK and search for
> > "Target
> > >>> >> > Version/s" = 3.4.3
> > >>> >> >
> > >>> >> > Committers should look at those and triage. Extremely important
> > bug
> > >>> >> > fixes, documentation, and API tweaks that impact compatibility
> > should
> > >>> >> > be worked on immediately. Everything else please retarget to an
> > >>> >> > appropriate release.
> > >>> >> >
> > >>> >> > ==
> > >>> >> > But my bug isn't fixed?
> > >>> >> > ==
> > >>> >> >
> > >>> >> > In order to make timely releases, we will typically not hold the
> > >>> >> > release unless the bug in question is a regression from the
> > previous
> > >>> >> > release. That being said, if there is something which is a
> > regression
> > >>> >> > that has not been correctly targeted please ping me or a
> > committer to
> > >>> >> > help target the issue.
> > >>> >> >
> > >>> >>
> > >>> >>
> > -
> > >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>> >>
> > >>>
> > >>> -
> > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>>
> >
> > 

Re: Which version of spark version supports parquet version 2 ?

2024-04-18 Thread Bjørn Jørgensen
" *Release 24.3 of Dremio will continue to write Parquet V1, since an
average performance degradation of 1.5% was observed in writes and 6.5% was
observed in queries when TPC-DS data was written using Parquet V2 instead
of Parquet V1.  The aforementioned query performance tests utilized the C3
cache to store data.*"
(...)
"*Users can enable Parquet V2 on write using the following configuration
key.*

ALTER SYSTEM SET "store.parquet.writer.version" = 'v2' "
https://www.dremio.com/blog/vectorized-reading-of-parquet-v2-improves-performance-up-to-75/

"*Java Vector API support*











*The feature is experimental and is currently not part of the parquet
distribution. Parquet-MR has supported Java Vector API to speed up reading,
to enable this feature:Java 17+, 64-bitRequiring the CPU to support
instruction sets:avx512vbmiavx512_vbmi2To build the jars: mvn clean package
-P vector-pluginsFor Apache Spark to enable this feature:Build parquet and
replace the parquet-encoding-{VERSION}.jar on the spark jars folderBuild
parquet-encoding-vector and copy parquet-encoding-vector-{VERSION}.jar to
the spark jars folderEdit spark class#VectorizedRleValuesReader,
function#readNextGroup refer to parquet class#ParquetReadRouter,
function#readBatchUsing512VectorBuild spark with maven and replace
spark-sql_2.12-{VERSION}.jar on the spark jars folder*"

https://github.com/apache/parquet-mr?tab=readme-ov-file#java-vector-api-support

You are using spark 3.2.0
spark version 3.2.4 was released April 13, 2023
https://spark.apache.org/releases/spark-release-3-2-4.html
You are using a spark version that is EOL.

tor. 18. apr. 2024 kl. 00:25 skrev Prem Sahoo :

> Hello Ryan,
> May I know how you can write Parquet V2 encoding from spark 3.2.0 ?  As
> per my knowledge Dremio is creating and reading Parquet V2.
> "Apache Parquet-MR Writer version PARQUET_2_0, which is widely adopted by
> engines that write Parquet data, supports delta encodings. However, these
> encodings were not previously supported by Dremio's vectorized Parquet
> reader, resulting in decreased speed. Now, in version 24.3 and Dremio
> Cloud, when you use the Dremio SQL query engine on Parquet datasets, you’ll
> receive best-in-class performance."
>
> Could you let me know where Parquet Community is not recommending Parquet
> V2 ?
>
>
>
> On Wed, Apr 17, 2024 at 2:44 PM Ryan Blue  wrote:
>
>> Prem, as I said earlier, v2 is not a finalized spec so you should not use
>> it. That's why it is not the default. You can get Spark to write v2 files,
>> but it isn't recommended by the Parquet community.
>>
>> On Wed, Apr 17, 2024 at 11:05 AM Prem Sahoo  wrote:
>>
>>> Hello Community,
>>> Could anyone shed more light on this (Spark Supporting Parquet V2)?
>>>
>>> On Tue, Apr 16, 2024 at 3:42 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi Prem,

 Regrettably this is not my area of speciality. I trust
 another colleague will have a more informed idea. Alternatively you may
 raise an SPIP for it.

 Spark Project Improvement Proposals (SPIP) | Apache Spark
 

 HTH

 Mich Talebzadeh,
 Technologist | Solutions Architect | Data Engineer  | Generative AI
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* The information provided is correct to the best of my
 knowledge but of course cannot be guaranteed . It is essential to note
 that, as with any advice, quote "one test result is worth one-thousand
 expert opinions (Werner
 Von Braun
 )".


 On Tue, 16 Apr 2024 at 18:17, Prem Sahoo  wrote:

> Hello Mich,
> Thanks for example.
> I have the same parquet-mr version which creates Parquet version 1. We
> need to create V2 as it is more optimized. We have Dremio where if we use
> Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % 
> better
> in case of write . so we are inclined towards this way.  Please let us 
> know
> why Spark is not going towards Parquet V2 ?
> Sent from my iPhone
>
> On Apr 16, 2024, at 1:04 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> 
> Well let us do a test in PySpark.
>
> Take this code and create a default parquet file. My spark is 3.4
>
> cat parquet_checxk.py
> from pyspark.sql import SparkSession
>
> spark =
> SparkSession.builder.appName("ParquetVersionExample").getOrCreate()
>
> data = [("London", 8974432), ("New York City", 8804348), ("Beijing",
> 21893000)]
> df = spark.createDataFrame(data, ["city", "population"])
>
>