Re: Which version of spark version supports parquet version 2 ?

Mich Talebzadeh Wed, 17 Apr 2024 13:07:27 -0700

Hi Prem,

Your question about writing Parquet v2 with Spark 3.2.0.


Spark 3.2.0 Limitations: Spark 3.2.0 doesn't have a built-in way to
explicitly force Parquet v2 encoding. As we saw previously, even Spark 3.4
created a file with parquet-mr version, indicating v1 encoding.

Dremio v2 Support: As I understand, Dremio versions 24.3 and later can read
Parquet v2 files with delta encodings.

Parquet v2 Status and  Spark. As Ryan alluded to, Spark currently does not
support Parquet v2

In the meantime, you can try excluding parquet-mr from your dependencies
and upgrading the parquet library (if possible) to see if it indirectly
enables v2 writing with Spark 3.2.0.

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Wed, 17 Apr 2024 at 20:20, Prem Sahoo <[email protected]> wrote:

> Hello Ryan,
> May I know how you can write Parquet V2 encoding from spark 3.2.0 ?  As
> per my knowledge Dremio is creating and reading Parquet V2.
> "Apache Parquet-MR Writer version PARQUET_2_0, which is widely adopted by
> engines that write Parquet data, supports delta encodings. However, these
> encodings were not previously supported by Dremio's vectorized Parquet
> reader, resulting in decreased speed. Now, in version 24.3 and Dremio
> Cloud, when you use the Dremio SQL query engine on Parquet datasets, you’ll
> receive best-in-class performance."
>
> Could you let me know where Parquet Community is not recommending Parquet
> V2 ?
>
>
>
> On Wed, Apr 17, 2024 at 2:44 PM Ryan Blue <[email protected]> wrote:
>
>> Prem, as I said earlier, v2 is not a finalized spec so you should not use
>> it. That's why it is not the default. You can get Spark to write v2 files,
>> but it isn't recommended by the Parquet community.
>>
>> On Wed, Apr 17, 2024 at 11:05 AM Prem Sahoo <[email protected]> wrote:
>>
>>> Hello Community,
>>> Could anyone shed more light on this (Spark Supporting Parquet V2)?
>>>
>>> On Tue, Apr 16, 2024 at 3:42 PM Mich Talebzadeh <
>>> [email protected]> wrote:
>>>
>>>> Hi Prem,
>>>>
>>>> Regrettably this is not my area of speciality. I trust
>>>> another colleague will have a more informed idea. Alternatively you may
>>>> raise an SPIP for it.
>>>>
>>>> Spark Project Improvement Proposals (SPIP) | Apache Spark
>>>> <https://spark.apache.org/improvement-proposals.html>
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* The information provided is correct to the best of my
>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>> expert opinions (Werner
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>
>>>>
>>>> On Tue, 16 Apr 2024 at 18:17, Prem Sahoo <[email protected]> wrote:
>>>>
>>>>> Hello Mich,
>>>>> Thanks for example.
>>>>> I have the same parquet-mr version which creates Parquet version 1. We
>>>>> need to create V2 as it is more optimized. We have Dremio where if we use
>>>>> Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % 
>>>>> better
>>>>> in case of write . so we are inclined towards this way.  Please let us 
>>>>> know
>>>>> why Spark is not going towards Parquet V2 ?
>>>>> Sent from my iPhone
>>>>>
>>>>> On Apr 16, 2024, at 1:04 PM, Mich Talebzadeh <
>>>>> [email protected]> wrote:
>>>>>
>>>>> 
>>>>> Well let us do a test in PySpark.
>>>>>
>>>>> Take this code and create a default parquet file. My spark is 3.4
>>>>>
>>>>> cat parquet_checxk.py
>>>>> from pyspark.sql import SparkSession
>>>>>
>>>>> spark =
>>>>> SparkSession.builder.appName("ParquetVersionExample").getOrCreate()
>>>>>
>>>>> data = [("London", 8974432), ("New York City", 8804348), ("Beijing",
>>>>> 21893000)]
>>>>> df = spark.createDataFrame(data, ["city", "population"])
>>>>>
>>>>> df.write.mode("overwrite").parquet("parquet_example")  # it create
>>>>> file in hdfs directory
>>>>>
>>>>> Use a tool called parquet-tools (downloadable using pip from
>>>>> https://pypi.org/project/parquet-tools/)
>>>>>
>>>>> Get the parquet files from hdfs to the current directory say
>>>>>
>>>>> hdfs dfs -get /user/hduser/parquet_example .
>>>>> cd ./parquet_example
>>>>> do an ls and pickup file 3 like below to inspect
>>>>>  parquet-tools inspect
>>>>> part-00003-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquet
>>>>>
>>>>> Now this is the output
>>>>>
>>>>> ############ file meta data ############
>>>>> created_by: parquet-mr version 1.12.3 (build
>>>>> f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
>>>>> num_columns: 2
>>>>> num_rows: 1
>>>>> num_row_groups: 1
>>>>> format_version: 1.0
>>>>> serialized_size: 563
>>>>>
>>>>>
>>>>> ############ Columns ############
>>>>> name
>>>>> age
>>>>>
>>>>> ############ Column(name) ############
>>>>> name: name
>>>>> path: name
>>>>> max_definition_level: 1
>>>>> max_repetition_level: 0
>>>>> physical_type: BYTE_ARRAY
>>>>> logical_type: String
>>>>> converted_type (legacy): UTF8
>>>>> compression: SNAPPY (space_saved: -5%)
>>>>>
>>>>> ############ Column(age) ############
>>>>> name: age
>>>>> path: age
>>>>> max_definition_level: 1
>>>>> max_repetition_level: 0
>>>>> physical_type: INT64
>>>>> logical_type: None
>>>>> converted_type (legacy): NONE
>>>>> compression: SNAPPY (space_saved: -5%)
>>>>>
>>>>> File Information:
>>>>>
>>>>>    - format_version: 1.0: This line explicitly states that the format
>>>>>    version of the Parquet file is 1.0, which corresponds to Parquet 
>>>>> version 1.
>>>>>    - created_by: parquet-mr version 1.12.3: While this doesn't
>>>>>    directly specify the format version, itt is accepted that older 
>>>>> versions of
>>>>>    parquet-mr like 1.12.3 typically write Parquet version 1 files.
>>>>>
>>>>> Since in this case Spark 3.4 is capable of reading both versions (1
>>>>> and 2), you don't  necessarily need to modify your Spark code to access
>>>>> this file. However, if you want to create Parquet files in version 2 using
>>>>> Spark, you might need to consider additional changes like excluding
>>>>> parquet-mr or upgrading Parquet libraries and do a custom build.of Spark.
>>>>> However, taking klaws of diminishing returns, I would not advise that
>>>>> either.. You can ofcourse usse gzip for compression that may be more
>>>>> suitable for your needs.
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>> expert opinions (Werner
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>
>>>>>
>>>>> On Tue, 16 Apr 2024 at 15:00, Prem Sahoo <[email protected]> wrote:
>>>>>
>>>>>> Hello Community,
>>>>>> Could any of you shed some light on below questions please ?
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Apr 15, 2024, at 9:02 PM, Prem Sahoo <[email protected]> wrote:
>>>>>>
>>>>>> 
>>>>>> Any specific reason spark does not support or community doesn't want
>>>>>> to go to Parquet V2 , which is more optimized and read and write is too
>>>>>> much faster (form other component which I am using)
>>>>>>
>>>>>> On Mon, Apr 15, 2024 at 7:55 PM Ryan Blue <[email protected]> wrote:
>>>>>>
>>>>>>> Spark will read data written with v2 encodings just fine. You just
>>>>>>> don't need to worry about making Spark produce v2. And you should 
>>>>>>> probably
>>>>>>> also not produce v2 encodings from other systems.
>>>>>>>
>>>>>>> On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> oops but so spark does not support parquet V2  atm ?, as We have a
>>>>>>>> use case where we need parquet V2 as  one of our components uses 
>>>>>>>> Parquet V2
>>>>>>>> .
>>>>>>>>
>>>>>>>> On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Prem,
>>>>>>>>>
>>>>>>>>> Parquet v1 is the default because v2 has not been finalized and
>>>>>>>>> adopted by the community. I highly recommend not using v2 encodings 
>>>>>>>>> at this
>>>>>>>>> time.
>>>>>>>>>
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>> On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I am using spark 3.2.0 . but my spark package comes with
>>>>>>>>>> parquet-mr 1.2.1 which writes in parquet version 1 not version 
>>>>>>>>>> version 2:(.
>>>>>>>>>> so I was looking how to write in Parquet version2 ?
>>>>>>>>>>
>>>>>>>>>> On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Sorry you have a point there. It was released in version 3.00.
>>>>>>>>>>> What version of spark are you using?
>>>>>>>>>>>
>>>>>>>>>>> Technologist | Solutions Architect | Data Engineer  | Generative
>>>>>>>>>>> AI
>>>>>>>>>>> London
>>>>>>>>>>> United Kingdom
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Disclaimer:* The information provided is correct to the best
>>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is 
>>>>>>>>>>> essential to
>>>>>>>>>>> note that, as with any advice, quote "one test result is worth
>>>>>>>>>>> one-thousand expert opinions (Werner
>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, 15 Apr 2024 at 21:33, Prem Sahoo <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thank you so much for the info! But do we have any release
>>>>>>>>>>>> notes where it says spark2.4.0 onwards supports parquet version 2. 
>>>>>>>>>>>> I was
>>>>>>>>>>>> under the impression Spark3.0 onwards it started supporting .
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Well if I am correct, Parquet version 2 support was introduced
>>>>>>>>>>>>> in Spark version 2.4.0. Therefore, any version of Spark starting 
>>>>>>>>>>>>> from 2.4.0
>>>>>>>>>>>>> supports Parquet version 2. Assuming that you are using Spark 
>>>>>>>>>>>>> version
>>>>>>>>>>>>> 2.4.0 or later, you should be able to take advantage of Parquet 
>>>>>>>>>>>>> version 2
>>>>>>>>>>>>> features.
>>>>>>>>>>>>>
>>>>>>>>>>>>> HTH
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>>>> Technologist | Solutions Architect | Data Engineer  |
>>>>>>>>>>>>> Generative AI
>>>>>>>>>>>>> London
>>>>>>>>>>>>> United Kingdom
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Disclaimer:* The information provided is correct to the best
>>>>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is 
>>>>>>>>>>>>> essential to
>>>>>>>>>>>>> note that, as with any advice, quote "one test result is
>>>>>>>>>>>>> worth one-thousand expert opinions (Werner
>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, 15 Apr 2024 at 20:53, Prem Sahoo <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you for the information!
>>>>>>>>>>>>>> I can use any version of parquet-mr to produce parquet file.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> regarding 2nd question .
>>>>>>>>>>>>>> Which version of spark is supporting parquet version 2?
>>>>>>>>>>>>>> May I get the release notes where parquet versions are
>>>>>>>>>>>>>> mentioned ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Parquet-mr is a Java library that provides functionality
>>>>>>>>>>>>>>> for working with Parquet files with hadoop. It is therefore  
>>>>>>>>>>>>>>> more geared
>>>>>>>>>>>>>>> towards working with Parquet files within the Hadoop ecosystem,
>>>>>>>>>>>>>>> particularly using MapReduce jobs. There is no definitive way 
>>>>>>>>>>>>>>> to check
>>>>>>>>>>>>>>> exact compatible versions within the library itself. However, 
>>>>>>>>>>>>>>> you can have
>>>>>>>>>>>>>>> a look at this
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://github.com/apache/parquet-mr/blob/master/CHANGES.md
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> HTH
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>>>>>> Technologist | Solutions Architect | Data Engineer  |
>>>>>>>>>>>>>>> Generative AI
>>>>>>>>>>>>>>> London
>>>>>>>>>>>>>>> United Kingdom
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Disclaimer:* The information provided is correct to the
>>>>>>>>>>>>>>> best of my knowledge but of course cannot be guaranteed . It is 
>>>>>>>>>>>>>>> essential
>>>>>>>>>>>>>>> to note that, as with any advice, quote "one test result is
>>>>>>>>>>>>>>> worth one-thousand expert opinions (Werner
>>>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, 15 Apr 2024 at 18:59, Prem Sahoo <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello Team,
>>>>>>>>>>>>>>>> May I know how to check which version of parquet is
>>>>>>>>>>>>>>>> supported by parquet-mr 1.2.1 ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Which version of parquet-mr is supporting parquet version 2
>>>>>>>>>>>>>>>> (V2) ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Which version of spark is supporting parquet version 2?
>>>>>>>>>>>>>>>> May I get the release notes where parquet versions are
>>>>>>>>>>>>>>>> mentioned ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: Which version of spark version supports parquet version 2 ?

Reply via email to