Re: Which version of spark version supports parquet version 2 ?

2024-04-26 Thread Prem Sahoo
Confirmed, closing this . Thanks everyone for valuable information.
Sent from my iPhone

> On Apr 25, 2024, at 9:55 AM, Prem Sahoo  wrote:
> 
> 
> Hello Spark ,
> After discussing with the Parquet and Pyarrow community . We can use the 
> below config so that Spark can write Parquet V2 files.
> 
> "hadoopConfiguration.set(“parquet.writer.version”, “v2”)" while creating 
> Parquet then those are V2 parquet.
> 
> Could you please confirm ?


Re: Which version of spark version supports parquet version 2 ?

2024-04-25 Thread Prem Sahoo
Hello Spark ,
After discussing with the Parquet and Pyarrow community . We can use the
below config so that Spark can write Parquet V2 files.

*"hadoopConfiguration.set(“parquet.writer.version”, “v2”)" while creating
Parquet then those are V2 parquet.*

*Could you please confirm ?*

>


Re: Which version of spark version supports parquet version 2 ?

2024-04-19 Thread Steve Loughran
Those are some quite good improvements -but committing to storing all your
data in an unstable format, is, well, "bold". For temporary data as part of
a workflow though, it could be appealing

Now, assuming you are going to be working with s3, you might want to start
with merging PARQUET-2117 into your version, as it is delivering tangible
speedups through parallel GET of different range downloads, at least
according to our most recent test runs.

[image: Screenshot 2024-04-12 at 11.38.02 AM.png]

what would be interesting is to see how the two combine: v2 and java21 AVX
processing of data, plus the 4X improvement in data retrieval (we limit the
#of active requests per stream to realistic numbers you can use in
production, FWIW).

See also; An Empirical Evaluation of Columnar Storage Formats
https://arxiv.org/abs/2304.05028

On Thu, 18 Apr 2024 at 08:31, Bjørn Jørgensen 
wrote:

> " *Release 24.3 of Dremio will continue to write Parquet V1, since an
> average performance degradation of 1.5% was observed in writes and 6.5% was
> observed in queries when TPC-DS data was written using Parquet V2 instead
> of Parquet V1.  The aforementioned query performance tests utilized the C3
> cache to store data.*"
> (...)
> "*Users can enable Parquet V2 on write using the following configuration
> key.*
>
> ALTER SYSTEM SET "store.parquet.writer.version" = 'v2' "
>
> https://www.dremio.com/blog/vectorized-reading-of-parquet-v2-improves-performance-up-to-75/
>
> "*Java Vector API support*
>
>
>
>
>
>
>
>
>
>
>
> *The feature is experimental and is currently not part of the parquet
> distribution. Parquet-MR has supported Java Vector API to speed up reading,
> to enable this feature:Java 17+, 64-bitRequiring the CPU to support
> instruction sets:avx512vbmiavx512_vbmi2To build the jars: mvn clean package
> -P vector-pluginsFor Apache Spark to enable this feature:Build parquet and
> replace the parquet-encoding-{VERSION}.jar on the spark jars folderBuild
> parquet-encoding-vector and copy parquet-encoding-vector-{VERSION}.jar to
> the spark jars folderEdit spark class#VectorizedRleValuesReader,
> function#readNextGroup refer to parquet class#ParquetReadRouter,
> function#readBatchUsing512VectorBuild spark with maven and replace
> spark-sql_2.12-{VERSION}.jar on the spark jars folder*"
>
>
> https://github.com/apache/parquet-mr?tab=readme-ov-file#java-vector-api-support
>
> You are using spark 3.2.0
> spark version 3.2.4 was released April 13, 2023
> https://spark.apache.org/releases/spark-release-3-2-4.html
> You are using a spark version that is EOL.
>
> tor. 18. apr. 2024 kl. 00:25 skrev Prem Sahoo :
>
>> Hello Ryan,
>> May I know how you can write Parquet V2 encoding from spark 3.2.0 ?  As
>> per my knowledge Dremio is creating and reading Parquet V2.
>> "Apache Parquet-MR Writer version PARQUET_2_0, which is widely adopted
>> by engines that write Parquet data, supports delta encodings. However,
>> these encodings were not previously supported by Dremio's vectorized
>> Parquet reader, resulting in decreased speed. Now, in version 24.3 and
>> Dremio Cloud, when you use the Dremio SQL query engine on Parquet datasets,
>> you’ll receive best-in-class performance."
>>
>> Could you let me know where Parquet Community is not recommending Parquet
>> V2 ?
>>
>>
>>
>> On Wed, Apr 17, 2024 at 2:44 PM Ryan Blue  wrote:
>>
>>> Prem, as I said earlier, v2 is not a finalized spec so you should not
>>> use it. That's why it is not the default. You can get Spark to write v2
>>> files, but it isn't recommended by the Parquet community.
>>>
>>> On Wed, Apr 17, 2024 at 11:05 AM Prem Sahoo 
>>> wrote:
>>>
 Hello Community,
 Could anyone shed more light on this (Spark Supporting Parquet V2)?

 On Tue, Apr 16, 2024 at 3:42 PM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi Prem,
>
> Regrettably this is not my area of speciality. I trust
> another colleague will have a more informed idea. Alternatively you may
> raise an SPIP for it.
>
> Spark Project Improvement Proposals (SPIP) | Apache Spark
> 
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner
> Von Braun
> )".
>
>
> On Tue, 16 Apr 2024 at 18:17, Prem Sahoo  wrote:
>
>> Hello Mich,
>> Thanks for 

Re: Which version of spark version supports parquet version 2 ?

2024-04-18 Thread Prem Sahoo
Thanks for below information. Sent from my iPhoneOn Apr 18, 2024, at 3:31 AM, Bjørn Jørgensen  wrote:"

Release 24.3 of Dremio will continue to write Parquet V1, since an average performance degradation of 1.5% was observed in writes and 6.5% was observed in queries when TPC-DS data was written using Parquet V2 instead of Parquet V1.  The aforementioned query performance tests utilized the C3 cache to store data."(...)"Users can enable Parquet V2 on write using the following configuration key.ALTER SYSTEM SET "store.parquet.writer.version" = 'v2' "https://www.dremio.com/blog/vectorized-reading-of-parquet-v2-improves-performance-up-to-75/"Java Vector API supportThe feature is experimental and is currently not part of the parquet distribution. Parquet-MR has supported Java Vector API to speed up reading, to enable this feature:Java 17+, 64-bitRequiring the CPU to support instruction sets:avx512vbmiavx512_vbmi2To build the jars: mvn clean package -P vector-pluginsFor Apache Spark to enable this feature:Build parquet and replace the parquet-encoding-{VERSION}.jar on the spark jars folderBuild parquet-encoding-vector and copy parquet-encoding-vector-{VERSION}.jar to the spark jars folderEdit spark class#VectorizedRleValuesReader, function#readNextGroup refer to parquet class#ParquetReadRouter, function#readBatchUsing512VectorBuild spark with maven and replace spark-sql_2.12-{VERSION}.jar on the spark jars folder"https://github.com/apache/parquet-mr?tab=readme-ov-file#java-vector-api-supportYou are using spark 3.2.0spark version 3.2.4 was released April 13, 2023 https://spark.apache.org/releases/spark-release-3-2-4.htmlYou are using a spark version that is EOL.tor. 18. apr. 2024 kl. 00:25 skrev Prem Sahoo :Hello Ryan,May I know how you can write Parquet V2 encoding from spark 3.2.0 ?  As per my knowledge Dremio is creating and reading Parquet V2. "Apache Parquet-MR Writer version PARQUET_2_0, which is widely adopted by engines that write Parquet data, supports delta encodings. However, these encodings were not previously supported by Dremio's vectorized Parquet reader, resulting in decreased speed. Now, in version 24.3 and Dremio Cloud, when you use the Dremio SQL query engine on Parquet datasets, you’ll receive best-in-class performance." Could you let me know where Parquet Community is not recommending Parquet V2 ?On Wed, Apr 17, 2024 at 2:44 PM Ryan Blue  wrote:Prem, as I said earlier, v2 is not a finalized spec so you should not use it. That's why it is not the default. You can get Spark to write v2 files, but it isn't recommended by the Parquet community.On Wed, Apr 17, 2024 at 11:05 AM Prem Sahoo  wrote:Hello Community,Could anyone shed more light on this (Spark Supporting Parquet V2)? On Tue, Apr 16, 2024 at 3:42 PM Mich Talebzadeh  wrote:Hi Prem,Regrettably this is not my area of speciality. I trust another colleague will have a more informed idea. Alternatively you may raise an SPIP for it.Spark Project Improvement Proposals (SPIP) | Apache SparkHTH

Mich Talebzadeh,Technologist | Solutions Architect | Data Engineer  | Generative AILondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

On Tue, 16 Apr 2024 at 18:17, Prem Sahoo  wrote:Hello Mich,Thanks for example.I have the same parquet-mr version which creates Parquet version 1. We need to create V2 as it is more optimized. We have Dremio where if we use Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % better in case of write . so we are inclined towards this way.  Please let us know why Spark is not going towards Parquet V2 ? Sent from my iPhoneOn Apr 16, 2024, at 1:04 PM, Mich Talebzadeh  wrote:Well let us do a test in PySpark. Take this code and create a default parquet file. My spark is 3.4cat parquet_checxk.pyfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName("ParquetVersionExample").getOrCreate()data = "" 8974432), ("New York City", 8804348), ("Beijing", 21893000)]df = spark.createDataFrame(data, ["city", "population"]) df.write.mode("overwrite").parquet("parquet_example")  # it create file in hdfs directoryUse a tool called parquet-tools (downloadable using pip from https://pypi.org/project/parquet-tools/)Get the parquet files from hdfs to the current directory sayhdfs dfs -get /user/hduser/parquet_example .cd ./parquet_exampledo an ls and pickup file 3 like below to inspect parquet-tools inspect part-3-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquetNow this is the output file meta data created_by: parquet-mr version 1.12.3 (build 

Re: Which version of spark version supports parquet version 2 ?

2024-04-18 Thread Bjørn Jørgensen
" *Release 24.3 of Dremio will continue to write Parquet V1, since an
average performance degradation of 1.5% was observed in writes and 6.5% was
observed in queries when TPC-DS data was written using Parquet V2 instead
of Parquet V1.  The aforementioned query performance tests utilized the C3
cache to store data.*"
(...)
"*Users can enable Parquet V2 on write using the following configuration
key.*

ALTER SYSTEM SET "store.parquet.writer.version" = 'v2' "
https://www.dremio.com/blog/vectorized-reading-of-parquet-v2-improves-performance-up-to-75/

"*Java Vector API support*











*The feature is experimental and is currently not part of the parquet
distribution. Parquet-MR has supported Java Vector API to speed up reading,
to enable this feature:Java 17+, 64-bitRequiring the CPU to support
instruction sets:avx512vbmiavx512_vbmi2To build the jars: mvn clean package
-P vector-pluginsFor Apache Spark to enable this feature:Build parquet and
replace the parquet-encoding-{VERSION}.jar on the spark jars folderBuild
parquet-encoding-vector and copy parquet-encoding-vector-{VERSION}.jar to
the spark jars folderEdit spark class#VectorizedRleValuesReader,
function#readNextGroup refer to parquet class#ParquetReadRouter,
function#readBatchUsing512VectorBuild spark with maven and replace
spark-sql_2.12-{VERSION}.jar on the spark jars folder*"

https://github.com/apache/parquet-mr?tab=readme-ov-file#java-vector-api-support

You are using spark 3.2.0
spark version 3.2.4 was released April 13, 2023
https://spark.apache.org/releases/spark-release-3-2-4.html
You are using a spark version that is EOL.

tor. 18. apr. 2024 kl. 00:25 skrev Prem Sahoo :

> Hello Ryan,
> May I know how you can write Parquet V2 encoding from spark 3.2.0 ?  As
> per my knowledge Dremio is creating and reading Parquet V2.
> "Apache Parquet-MR Writer version PARQUET_2_0, which is widely adopted by
> engines that write Parquet data, supports delta encodings. However, these
> encodings were not previously supported by Dremio's vectorized Parquet
> reader, resulting in decreased speed. Now, in version 24.3 and Dremio
> Cloud, when you use the Dremio SQL query engine on Parquet datasets, you’ll
> receive best-in-class performance."
>
> Could you let me know where Parquet Community is not recommending Parquet
> V2 ?
>
>
>
> On Wed, Apr 17, 2024 at 2:44 PM Ryan Blue  wrote:
>
>> Prem, as I said earlier, v2 is not a finalized spec so you should not use
>> it. That's why it is not the default. You can get Spark to write v2 files,
>> but it isn't recommended by the Parquet community.
>>
>> On Wed, Apr 17, 2024 at 11:05 AM Prem Sahoo  wrote:
>>
>>> Hello Community,
>>> Could anyone shed more light on this (Spark Supporting Parquet V2)?
>>>
>>> On Tue, Apr 16, 2024 at 3:42 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi Prem,

 Regrettably this is not my area of speciality. I trust
 another colleague will have a more informed idea. Alternatively you may
 raise an SPIP for it.

 Spark Project Improvement Proposals (SPIP) | Apache Spark
 

 HTH

 Mich Talebzadeh,
 Technologist | Solutions Architect | Data Engineer  | Generative AI
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* The information provided is correct to the best of my
 knowledge but of course cannot be guaranteed . It is essential to note
 that, as with any advice, quote "one test result is worth one-thousand
 expert opinions (Werner
 Von Braun
 )".


 On Tue, 16 Apr 2024 at 18:17, Prem Sahoo  wrote:

> Hello Mich,
> Thanks for example.
> I have the same parquet-mr version which creates Parquet version 1. We
> need to create V2 as it is more optimized. We have Dremio where if we use
> Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % 
> better
> in case of write . so we are inclined towards this way.  Please let us 
> know
> why Spark is not going towards Parquet V2 ?
> Sent from my iPhone
>
> On Apr 16, 2024, at 1:04 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> 
> Well let us do a test in PySpark.
>
> Take this code and create a default parquet file. My spark is 3.4
>
> cat parquet_checxk.py
> from pyspark.sql import SparkSession
>
> spark =
> SparkSession.builder.appName("ParquetVersionExample").getOrCreate()
>
> data = [("London", 8974432), ("New York City", 8804348), ("Beijing",
> 21893000)]
> df = spark.createDataFrame(data, ["city", "population"])
>
> 

Re: Which version of spark version supports parquet version 2 ?

2024-04-17 Thread Mich Talebzadeh
Hi Prem,

Your question about writing Parquet v2 with Spark 3.2.0.

Spark 3.2.0 Limitations: Spark 3.2.0 doesn't have a built-in way to
explicitly force Parquet v2 encoding. As we saw previously, even Spark 3.4
created a file with parquet-mr version, indicating v1 encoding.

Dremio v2 Support: As I understand, Dremio versions 24.3 and later can read
Parquet v2 files with delta encodings.

Parquet v2 Status and  Spark. As Ryan alluded to, Spark currently does not
support Parquet v2

In the meantime, you can try excluding parquet-mr from your dependencies
and upgrading the parquet library (if possible) to see if it indirectly
enables v2 writing with Spark 3.2.0.

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Wed, 17 Apr 2024 at 20:20, Prem Sahoo  wrote:

> Hello Ryan,
> May I know how you can write Parquet V2 encoding from spark 3.2.0 ?  As
> per my knowledge Dremio is creating and reading Parquet V2.
> "Apache Parquet-MR Writer version PARQUET_2_0, which is widely adopted by
> engines that write Parquet data, supports delta encodings. However, these
> encodings were not previously supported by Dremio's vectorized Parquet
> reader, resulting in decreased speed. Now, in version 24.3 and Dremio
> Cloud, when you use the Dremio SQL query engine on Parquet datasets, you’ll
> receive best-in-class performance."
>
> Could you let me know where Parquet Community is not recommending Parquet
> V2 ?
>
>
>
> On Wed, Apr 17, 2024 at 2:44 PM Ryan Blue  wrote:
>
>> Prem, as I said earlier, v2 is not a finalized spec so you should not use
>> it. That's why it is not the default. You can get Spark to write v2 files,
>> but it isn't recommended by the Parquet community.
>>
>> On Wed, Apr 17, 2024 at 11:05 AM Prem Sahoo  wrote:
>>
>>> Hello Community,
>>> Could anyone shed more light on this (Spark Supporting Parquet V2)?
>>>
>>> On Tue, Apr 16, 2024 at 3:42 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi Prem,

 Regrettably this is not my area of speciality. I trust
 another colleague will have a more informed idea. Alternatively you may
 raise an SPIP for it.

 Spark Project Improvement Proposals (SPIP) | Apache Spark
 

 HTH

 Mich Talebzadeh,
 Technologist | Solutions Architect | Data Engineer  | Generative AI
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* The information provided is correct to the best of my
 knowledge but of course cannot be guaranteed . It is essential to note
 that, as with any advice, quote "one test result is worth one-thousand
 expert opinions (Werner
 Von Braun
 )".


 On Tue, 16 Apr 2024 at 18:17, Prem Sahoo  wrote:

> Hello Mich,
> Thanks for example.
> I have the same parquet-mr version which creates Parquet version 1. We
> need to create V2 as it is more optimized. We have Dremio where if we use
> Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % 
> better
> in case of write . so we are inclined towards this way.  Please let us 
> know
> why Spark is not going towards Parquet V2 ?
> Sent from my iPhone
>
> On Apr 16, 2024, at 1:04 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> 
> Well let us do a test in PySpark.
>
> Take this code and create a default parquet file. My spark is 3.4
>
> cat parquet_checxk.py
> from pyspark.sql import SparkSession
>
> spark =
> SparkSession.builder.appName("ParquetVersionExample").getOrCreate()
>
> data = [("London", 8974432), ("New York City", 8804348), ("Beijing",
> 21893000)]
> df = spark.createDataFrame(data, ["city", "population"])
>
> df.write.mode("overwrite").parquet("parquet_example")  # it create
> file in hdfs directory
>
> Use a tool called parquet-tools (downloadable using pip from
> https://pypi.org/project/parquet-tools/)
>
> Get the parquet files from hdfs to the current directory say
>
> hdfs dfs -get /user/hduser/parquet_example .
> cd ./parquet_example
> do an ls and 

Re: Which version of spark version supports parquet version 2 ?

2024-04-17 Thread Prem Sahoo
Hello Ryan,
May I know how you can write Parquet V2 encoding from spark 3.2.0 ?  As per
my knowledge Dremio is creating and reading Parquet V2.
"Apache Parquet-MR Writer version PARQUET_2_0, which is widely adopted by
engines that write Parquet data, supports delta encodings. However, these
encodings were not previously supported by Dremio's vectorized Parquet
reader, resulting in decreased speed. Now, in version 24.3 and Dremio
Cloud, when you use the Dremio SQL query engine on Parquet datasets, you’ll
receive best-in-class performance."

Could you let me know where Parquet Community is not recommending Parquet
V2 ?



On Wed, Apr 17, 2024 at 2:44 PM Ryan Blue  wrote:

> Prem, as I said earlier, v2 is not a finalized spec so you should not use
> it. That's why it is not the default. You can get Spark to write v2 files,
> but it isn't recommended by the Parquet community.
>
> On Wed, Apr 17, 2024 at 11:05 AM Prem Sahoo  wrote:
>
>> Hello Community,
>> Could anyone shed more light on this (Spark Supporting Parquet V2)?
>>
>> On Tue, Apr 16, 2024 at 3:42 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Prem,
>>>
>>> Regrettably this is not my area of speciality. I trust another colleague
>>> will have a more informed idea. Alternatively you may raise an SPIP for it.
>>>
>>> Spark Project Improvement Proposals (SPIP) | Apache Spark
>>> 
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Tue, 16 Apr 2024 at 18:17, Prem Sahoo  wrote:
>>>
 Hello Mich,
 Thanks for example.
 I have the same parquet-mr version which creates Parquet version 1. We
 need to create V2 as it is more optimized. We have Dremio where if we use
 Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % better
 in case of write . so we are inclined towards this way.  Please let us know
 why Spark is not going towards Parquet V2 ?
 Sent from my iPhone

 On Apr 16, 2024, at 1:04 PM, Mich Talebzadeh 
 wrote:

 
 Well let us do a test in PySpark.

 Take this code and create a default parquet file. My spark is 3.4

 cat parquet_checxk.py
 from pyspark.sql import SparkSession

 spark =
 SparkSession.builder.appName("ParquetVersionExample").getOrCreate()

 data = [("London", 8974432), ("New York City", 8804348), ("Beijing",
 21893000)]
 df = spark.createDataFrame(data, ["city", "population"])

 df.write.mode("overwrite").parquet("parquet_example")  # it create
 file in hdfs directory

 Use a tool called parquet-tools (downloadable using pip from
 https://pypi.org/project/parquet-tools/)

 Get the parquet files from hdfs to the current directory say

 hdfs dfs -get /user/hduser/parquet_example .
 cd ./parquet_example
 do an ls and pickup file 3 like below to inspect
  parquet-tools inspect
 part-3-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquet

 Now this is the output

  file meta data 
 created_by: parquet-mr version 1.12.3 (build
 f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
 num_columns: 2
 num_rows: 1
 num_row_groups: 1
 format_version: 1.0
 serialized_size: 563


  Columns 
 name
 age

  Column(name) 
 name: name
 path: name
 max_definition_level: 1
 max_repetition_level: 0
 physical_type: BYTE_ARRAY
 logical_type: String
 converted_type (legacy): UTF8
 compression: SNAPPY (space_saved: -5%)

  Column(age) 
 name: age
 path: age
 max_definition_level: 1
 max_repetition_level: 0
 physical_type: INT64
 logical_type: None
 converted_type (legacy): NONE
 compression: SNAPPY (space_saved: -5%)

 File Information:

- format_version: 1.0: This line explicitly states that the format
version of the Parquet file is 1.0, which corresponds to Parquet 
 version 1.
- created_by: parquet-mr version 1.12.3: While this doesn't
directly specify the format version, itt is accepted that older 
 versions of
parquet-mr like 1.12.3 typically write Parquet version 1 files.

 Since 

Re: Which version of spark version supports parquet version 2 ?

2024-04-17 Thread Ryan Blue
Prem, as I said earlier, v2 is not a finalized spec so you should not use
it. That's why it is not the default. You can get Spark to write v2 files,
but it isn't recommended by the Parquet community.

On Wed, Apr 17, 2024 at 11:05 AM Prem Sahoo  wrote:

> Hello Community,
> Could anyone shed more light on this (Spark Supporting Parquet V2)?
>
> On Tue, Apr 16, 2024 at 3:42 PM Mich Talebzadeh 
> wrote:
>
>> Hi Prem,
>>
>> Regrettably this is not my area of speciality. I trust another colleague
>> will have a more informed idea. Alternatively you may raise an SPIP for it.
>>
>> Spark Project Improvement Proposals (SPIP) | Apache Spark
>> 
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Tue, 16 Apr 2024 at 18:17, Prem Sahoo  wrote:
>>
>>> Hello Mich,
>>> Thanks for example.
>>> I have the same parquet-mr version which creates Parquet version 1. We
>>> need to create V2 as it is more optimized. We have Dremio where if we use
>>> Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % better
>>> in case of write . so we are inclined towards this way.  Please let us know
>>> why Spark is not going towards Parquet V2 ?
>>> Sent from my iPhone
>>>
>>> On Apr 16, 2024, at 1:04 PM, Mich Talebzadeh 
>>> wrote:
>>>
>>> 
>>> Well let us do a test in PySpark.
>>>
>>> Take this code and create a default parquet file. My spark is 3.4
>>>
>>> cat parquet_checxk.py
>>> from pyspark.sql import SparkSession
>>>
>>> spark =
>>> SparkSession.builder.appName("ParquetVersionExample").getOrCreate()
>>>
>>> data = [("London", 8974432), ("New York City", 8804348), ("Beijing",
>>> 21893000)]
>>> df = spark.createDataFrame(data, ["city", "population"])
>>>
>>> df.write.mode("overwrite").parquet("parquet_example")  # it create file
>>> in hdfs directory
>>>
>>> Use a tool called parquet-tools (downloadable using pip from
>>> https://pypi.org/project/parquet-tools/)
>>>
>>> Get the parquet files from hdfs to the current directory say
>>>
>>> hdfs dfs -get /user/hduser/parquet_example .
>>> cd ./parquet_example
>>> do an ls and pickup file 3 like below to inspect
>>>  parquet-tools inspect
>>> part-3-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquet
>>>
>>> Now this is the output
>>>
>>>  file meta data 
>>> created_by: parquet-mr version 1.12.3 (build
>>> f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
>>> num_columns: 2
>>> num_rows: 1
>>> num_row_groups: 1
>>> format_version: 1.0
>>> serialized_size: 563
>>>
>>>
>>>  Columns 
>>> name
>>> age
>>>
>>>  Column(name) 
>>> name: name
>>> path: name
>>> max_definition_level: 1
>>> max_repetition_level: 0
>>> physical_type: BYTE_ARRAY
>>> logical_type: String
>>> converted_type (legacy): UTF8
>>> compression: SNAPPY (space_saved: -5%)
>>>
>>>  Column(age) 
>>> name: age
>>> path: age
>>> max_definition_level: 1
>>> max_repetition_level: 0
>>> physical_type: INT64
>>> logical_type: None
>>> converted_type (legacy): NONE
>>> compression: SNAPPY (space_saved: -5%)
>>>
>>> File Information:
>>>
>>>- format_version: 1.0: This line explicitly states that the format
>>>version of the Parquet file is 1.0, which corresponds to Parquet version 
>>> 1.
>>>- created_by: parquet-mr version 1.12.3: While this doesn't directly
>>>specify the format version, itt is accepted that older versions of
>>>parquet-mr like 1.12.3 typically write Parquet version 1 files.
>>>
>>> Since in this case Spark 3.4 is capable of reading both versions (1 and
>>> 2), you don't  necessarily need to modify your Spark code to access this
>>> file. However, if you want to create Parquet files in version 2 using
>>> Spark, you might need to consider additional changes like excluding
>>> parquet-mr or upgrading Parquet libraries and do a custom build.of Spark.
>>> However, taking klaws of diminishing returns, I would not advise that
>>> either.. You can ofcourse usse gzip for compression that may be more
>>> suitable for your needs.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>

Re: Which version of spark version supports parquet version 2 ?

2024-04-17 Thread Prem Sahoo
Hello Community,
Could anyone shed more light on this (Spark Supporting Parquet V2)?

On Tue, Apr 16, 2024 at 3:42 PM Mich Talebzadeh 
wrote:

> Hi Prem,
>
> Regrettably this is not my area of speciality. I trust another colleague
> will have a more informed idea. Alternatively you may raise an SPIP for it.
>
> Spark Project Improvement Proposals (SPIP) | Apache Spark
> 
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Tue, 16 Apr 2024 at 18:17, Prem Sahoo  wrote:
>
>> Hello Mich,
>> Thanks for example.
>> I have the same parquet-mr version which creates Parquet version 1. We
>> need to create V2 as it is more optimized. We have Dremio where if we use
>> Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % better
>> in case of write . so we are inclined towards this way.  Please let us know
>> why Spark is not going towards Parquet V2 ?
>> Sent from my iPhone
>>
>> On Apr 16, 2024, at 1:04 PM, Mich Talebzadeh 
>> wrote:
>>
>> 
>> Well let us do a test in PySpark.
>>
>> Take this code and create a default parquet file. My spark is 3.4
>>
>> cat parquet_checxk.py
>> from pyspark.sql import SparkSession
>>
>> spark =
>> SparkSession.builder.appName("ParquetVersionExample").getOrCreate()
>>
>> data = [("London", 8974432), ("New York City", 8804348), ("Beijing",
>> 21893000)]
>> df = spark.createDataFrame(data, ["city", "population"])
>>
>> df.write.mode("overwrite").parquet("parquet_example")  # it create file
>> in hdfs directory
>>
>> Use a tool called parquet-tools (downloadable using pip from
>> https://pypi.org/project/parquet-tools/)
>>
>> Get the parquet files from hdfs to the current directory say
>>
>> hdfs dfs -get /user/hduser/parquet_example .
>> cd ./parquet_example
>> do an ls and pickup file 3 like below to inspect
>>  parquet-tools inspect
>> part-3-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquet
>>
>> Now this is the output
>>
>>  file meta data 
>> created_by: parquet-mr version 1.12.3 (build
>> f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
>> num_columns: 2
>> num_rows: 1
>> num_row_groups: 1
>> format_version: 1.0
>> serialized_size: 563
>>
>>
>>  Columns 
>> name
>> age
>>
>>  Column(name) 
>> name: name
>> path: name
>> max_definition_level: 1
>> max_repetition_level: 0
>> physical_type: BYTE_ARRAY
>> logical_type: String
>> converted_type (legacy): UTF8
>> compression: SNAPPY (space_saved: -5%)
>>
>>  Column(age) 
>> name: age
>> path: age
>> max_definition_level: 1
>> max_repetition_level: 0
>> physical_type: INT64
>> logical_type: None
>> converted_type (legacy): NONE
>> compression: SNAPPY (space_saved: -5%)
>>
>> File Information:
>>
>>- format_version: 1.0: This line explicitly states that the format
>>version of the Parquet file is 1.0, which corresponds to Parquet version 
>> 1.
>>- created_by: parquet-mr version 1.12.3: While this doesn't directly
>>specify the format version, itt is accepted that older versions of
>>parquet-mr like 1.12.3 typically write Parquet version 1 files.
>>
>> Since in this case Spark 3.4 is capable of reading both versions (1 and
>> 2), you don't  necessarily need to modify your Spark code to access this
>> file. However, if you want to create Parquet files in version 2 using
>> Spark, you might need to consider additional changes like excluding
>> parquet-mr or upgrading Parquet libraries and do a custom build.of Spark.
>> However, taking klaws of diminishing returns, I would not advise that
>> either.. You can ofcourse usse gzip for compression that may be more
>> suitable for your needs.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Tue, 16 Apr 2024 at 15:00, 

Re: Which version of spark version supports parquet version 2 ?

2024-04-16 Thread Mich Talebzadeh
Hi Prem,

Regrettably this is not my area of speciality. I trust another colleague
will have a more informed idea. Alternatively you may raise an SPIP for it.

Spark Project Improvement Proposals (SPIP) | Apache Spark


HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 16 Apr 2024 at 18:17, Prem Sahoo  wrote:

> Hello Mich,
> Thanks for example.
> I have the same parquet-mr version which creates Parquet version 1. We
> need to create V2 as it is more optimized. We have Dremio where if we use
> Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % better
> in case of write . so we are inclined towards this way.  Please let us know
> why Spark is not going towards Parquet V2 ?
> Sent from my iPhone
>
> On Apr 16, 2024, at 1:04 PM, Mich Talebzadeh 
> wrote:
>
> 
> Well let us do a test in PySpark.
>
> Take this code and create a default parquet file. My spark is 3.4
>
> cat parquet_checxk.py
> from pyspark.sql import SparkSession
>
> spark = SparkSession.builder.appName("ParquetVersionExample").getOrCreate()
>
> data = [("London", 8974432), ("New York City", 8804348), ("Beijing",
> 21893000)]
> df = spark.createDataFrame(data, ["city", "population"])
>
> df.write.mode("overwrite").parquet("parquet_example")  # it create file
> in hdfs directory
>
> Use a tool called parquet-tools (downloadable using pip from
> https://pypi.org/project/parquet-tools/)
>
> Get the parquet files from hdfs to the current directory say
>
> hdfs dfs -get /user/hduser/parquet_example .
> cd ./parquet_example
> do an ls and pickup file 3 like below to inspect
>  parquet-tools inspect
> part-3-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquet
>
> Now this is the output
>
>  file meta data 
> created_by: parquet-mr version 1.12.3 (build
> f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
> num_columns: 2
> num_rows: 1
> num_row_groups: 1
> format_version: 1.0
> serialized_size: 563
>
>
>  Columns 
> name
> age
>
>  Column(name) 
> name: name
> path: name
> max_definition_level: 1
> max_repetition_level: 0
> physical_type: BYTE_ARRAY
> logical_type: String
> converted_type (legacy): UTF8
> compression: SNAPPY (space_saved: -5%)
>
>  Column(age) 
> name: age
> path: age
> max_definition_level: 1
> max_repetition_level: 0
> physical_type: INT64
> logical_type: None
> converted_type (legacy): NONE
> compression: SNAPPY (space_saved: -5%)
>
> File Information:
>
>- format_version: 1.0: This line explicitly states that the format
>version of the Parquet file is 1.0, which corresponds to Parquet version 1.
>- created_by: parquet-mr version 1.12.3: While this doesn't directly
>specify the format version, itt is accepted that older versions of
>parquet-mr like 1.12.3 typically write Parquet version 1 files.
>
> Since in this case Spark 3.4 is capable of reading both versions (1 and
> 2), you don't  necessarily need to modify your Spark code to access this
> file. However, if you want to create Parquet files in version 2 using
> Spark, you might need to consider additional changes like excluding
> parquet-mr or upgrading Parquet libraries and do a custom build.of Spark.
> However, taking klaws of diminishing returns, I would not advise that
> either.. You can ofcourse usse gzip for compression that may be more
> suitable for your needs.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Tue, 16 Apr 2024 at 15:00, Prem Sahoo  wrote:
>
>> Hello Community,
>> Could any of you shed some light on below questions please ?
>> Sent from my iPhone
>>
>> On Apr 15, 2024, at 9:02 PM, Prem Sahoo  wrote:
>>
>> 
>> Any specific reason spark does not support or community doesn't want to
>> go to Parquet V2 , which is more optimized and read and 

Re: Which version of spark version supports parquet version 2 ?

2024-04-16 Thread Prem Sahoo
Hello Mich,Thanks for example.I have the same parquet-mr version which creates Parquet version 1. We need to create V2 as it is more optimized. We have Dremio where if we use Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % better in case of write . so we are inclined towards this way.  Please let us know why Spark is not going towards Parquet V2 ? Sent from my iPhoneOn Apr 16, 2024, at 1:04 PM, Mich Talebzadeh  wrote:Well let us do a test in PySpark. Take this code and create a default parquet file. My spark is 3.4cat parquet_checxk.pyfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName("ParquetVersionExample").getOrCreate()data = "" 8974432), ("New York City", 8804348), ("Beijing", 21893000)]df = spark.createDataFrame(data, ["city", "population"]) df.write.mode("overwrite").parquet("parquet_example")  # it create file in hdfs directoryUse a tool called parquet-tools (downloadable using pip from https://pypi.org/project/parquet-tools/)Get the parquet files from hdfs to the current directory sayhdfs dfs -get /user/hduser/parquet_example .cd ./parquet_exampledo an ls and pickup file 3 like below to inspect parquet-tools inspect part-3-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquetNow this is the output file meta data created_by: parquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)num_columns: 2num_rows: 1num_row_groups: 1format_version: 1.0serialized_size: 563 Columns nameage Column(name) name: namepath: namemax_definition_level: 1max_repetition_level: 0physical_type: BYTE_ARRAYlogical_type: Stringconverted_type (legacy): UTF8compression: SNAPPY (space_saved: -5%) Column(age) name: agepath: agemax_definition_level: 1max_repetition_level: 0physical_type: INT64logical_type: Noneconverted_type (legacy): NONEcompression: SNAPPY (space_saved: -5%)File Information:format_version: 1.0: This line explicitly states that the format version of the Parquet file is 1.0, which corresponds to Parquet version 1.created_by: parquet-mr version 1.12.3: While this doesn't directly specify the format version, itt is accepted that older versions of parquet-mr like 1.12.3 typically write Parquet version 1 files.Since in this case Spark 3.4 is capable of reading both versions (1 and 2), you don't  necessarily need to modify your Spark code to access this file. However, if you want to create Parquet files in version 2 using Spark, you might need to consider additional changes like excluding parquet-mr or upgrading Parquet libraries and do a custom build.of Spark. However, taking klaws of diminishing returns, I would not advise that either.. You can ofcourse usse gzip for compression that may be more suitable for your needs.HTH Mich Talebzadeh,Technologist | Solutions Architect | Data Engineer  | Generative AILondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

On Tue, 16 Apr 2024 at 15:00, Prem Sahoo  wrote:Hello Community,Could any of you shed some light on below questions please ?Sent from my iPhoneOn Apr 15, 2024, at 9:02 PM, Prem Sahoo  wrote:Any specific reason spark does not support or community doesn't want to go to Parquet V2 , which is more optimized and read and write is too much faster (form other component which I am using)On Mon, Apr 15, 2024 at 7:55 PM Ryan Blue  wrote:Spark will read data written with v2 encodings just fine. You just don't need to worry about making Spark produce v2. And you should probably also not produce v2 encodings from other systems.On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo  wrote:oops but so spark does not support parquet V2  atm ?, as We have a use case where we need parquet V2 as  one of our components uses Parquet V2 .On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue  wrote:Hi Prem,Parquet v1 is the default because v2 has not been finalized and adopted by the community. I highly recommend not using v2 encodings at this time.RyanOn Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo  wrote:I am using spark 3.2.0 . but my spark package comes with parquet-mr 1.2.1 which writes in parquet version 1 not version version 2:(. so I was looking how to write in Parquet version2 ?On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh  wrote:Sorry you have a point there. It was released in version 3.00. What version of spark are you using?Technologist | Solutions Architect | Data Engineer  | Generative AILondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: The 

Re: Which version of spark version supports parquet version 2 ?

2024-04-16 Thread Mich Talebzadeh
Well let us do a test in PySpark.

Take this code and create a default parquet file. My spark is 3.4

cat parquet_checxk.py
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ParquetVersionExample").getOrCreate()

data = [("London", 8974432), ("New York City", 8804348), ("Beijing",
21893000)]
df = spark.createDataFrame(data, ["city", "population"])

df.write.mode("overwrite").parquet("parquet_example")  # it create file in
hdfs directory

Use a tool called parquet-tools (downloadable using pip from
https://pypi.org/project/parquet-tools/)

Get the parquet files from hdfs to the current directory say

hdfs dfs -get /user/hduser/parquet_example .
cd ./parquet_example
do an ls and pickup file 3 like below to inspect
 parquet-tools inspect
part-3-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquet

Now this is the output

 file meta data 
created_by: parquet-mr version 1.12.3 (build
f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
num_columns: 2
num_rows: 1
num_row_groups: 1
format_version: 1.0
serialized_size: 563


 Columns 
name
age

 Column(name) 
name: name
path: name
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: -5%)

 Column(age) 
name: age
path: age
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -5%)

File Information:

   - format_version: 1.0: This line explicitly states that the format
   version of the Parquet file is 1.0, which corresponds to Parquet version 1.
   - created_by: parquet-mr version 1.12.3: While this doesn't directly
   specify the format version, itt is accepted that older versions of
   parquet-mr like 1.12.3 typically write Parquet version 1 files.

Since in this case Spark 3.4 is capable of reading both versions (1 and 2),
you don't  necessarily need to modify your Spark code to access this file.
However, if you want to create Parquet files in version 2 using Spark, you
might need to consider additional changes like excluding parquet-mr or
upgrading Parquet libraries and do a custom build.of Spark. However, taking
klaws of diminishing returns, I would not advise that either.. You can
ofcourse usse gzip for compression that may be more suitable for your needs.

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 16 Apr 2024 at 15:00, Prem Sahoo  wrote:

> Hello Community,
> Could any of you shed some light on below questions please ?
> Sent from my iPhone
>
> On Apr 15, 2024, at 9:02 PM, Prem Sahoo  wrote:
>
> 
> Any specific reason spark does not support or community doesn't want to go
> to Parquet V2 , which is more optimized and read and write is too much
> faster (form other component which I am using)
>
> On Mon, Apr 15, 2024 at 7:55 PM Ryan Blue  wrote:
>
>> Spark will read data written with v2 encodings just fine. You just don't
>> need to worry about making Spark produce v2. And you should probably also
>> not produce v2 encodings from other systems.
>>
>> On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo  wrote:
>>
>>> oops but so spark does not support parquet V2  atm ?, as We have a use
>>> case where we need parquet V2 as  one of our components uses Parquet V2 .
>>>
>>> On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue  wrote:
>>>
 Hi Prem,

 Parquet v1 is the default because v2 has not been finalized and adopted
 by the community. I highly recommend not using v2 encodings at this time.

 Ryan

 On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo 
 wrote:

> I am using spark 3.2.0 . but my spark package comes with parquet-mr
> 1.2.1 which writes in parquet version 1 not version version 2:(. so I was
> looking how to write in Parquet version2 ?
>
> On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Sorry you have a point there. It was released in version 3.00. What
>> version of spark are you using?
>>
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh

Re: Which version of spark version supports parquet version 2 ?

2024-04-16 Thread Prem Sahoo
Hello Community,Could any of you shed some light on below questions please ?Sent from my iPhoneOn Apr 15, 2024, at 9:02 PM, Prem Sahoo  wrote:Any specific reason spark does not support or community doesn't want to go to Parquet V2 , which is more optimized and read and write is too much faster (form other component which I am using)On Mon, Apr 15, 2024 at 7:55 PM Ryan Blue  wrote:Spark will read data written with v2 encodings just fine. You just don't need to worry about making Spark produce v2. And you should probably also not produce v2 encodings from other systems.On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo  wrote:oops but so spark does not support parquet V2  atm ?, as We have a use case where we need parquet V2 as  one of our components uses Parquet V2 .On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue  wrote:Hi Prem,Parquet v1 is the default because v2 has not been finalized and adopted by the community. I highly recommend not using v2 encodings at this time.RyanOn Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo  wrote:I am using spark 3.2.0 . but my spark package comes with parquet-mr 1.2.1 which writes in parquet version 1 not version version 2:(. so I was looking how to write in Parquet version2 ?On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh  wrote:Sorry you have a point there. It was released in version 3.00. What version of spark are you using?Technologist | Solutions Architect | Data Engineer  | Generative AILondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

On Mon, 15 Apr 2024 at 21:33, Prem Sahoo  wrote:Thank you so much for the info! But do we have any release notes where it says spark2.4.0 onwards supports parquet version 2. I was under the impression Spark3.0 onwards it started supporting .On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh  wrote:Well if I am correct, Parquet version 2 support was introduced in Spark version 2.4.0. Therefore, any version of Spark starting from 2.4.0 supports Parquet version 2. Assuming that you are using Spark version  2.4.0 or later, you should be able to take advantage of Parquet version 2 features.HTH

Mich Talebzadeh,Technologist | Solutions Architect | Data Engineer  | Generative AILondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

On Mon, 15 Apr 2024 at 20:53, Prem Sahoo  wrote:Thank you for the information! I can use any version of parquet-mr to produce parquet file.regarding 2nd question .Which version of spark is supporting parquet version 2?May I get the release notes where parquet versions are mentioned ?On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh  wrote:Parquet-mr is a Java library that provides functionality for working with Parquet files with hadoop. It is therefore  more geared towards working with Parquet files within the Hadoop ecosystem, particularly using MapReduce jobs. There is no definitive way to check exact compatible versions within the library itself. However, you can have a look at thishttps://github.com/apache/parquet-mr/blob/master/CHANGES.mdHTH

Mich Talebzadeh,Technologist | Solutions Architect | Data Engineer  | Generative AILondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

On Mon, 15 Apr 2024 at 18:59, Prem Sahoo  wrote:Hello Team,May I know how to check which version of parquet is supported by parquet-mr 1.2.1 ?Which version of parquet-mr is supporting parquet version 2 (V2) ?Which version of spark is supporting parquet version 2?May I get the release notes where parquet versions are mentioned ?






-- Ryan BlueTabular

-- Ryan BlueTabular



Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
Any specific reason spark does not support or community doesn't want to go
to Parquet V2 , which is more optimized and read and write is too much
faster (form other component which I am using)

On Mon, Apr 15, 2024 at 7:55 PM Ryan Blue  wrote:

> Spark will read data written with v2 encodings just fine. You just don't
> need to worry about making Spark produce v2. And you should probably also
> not produce v2 encodings from other systems.
>
> On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo  wrote:
>
>> oops but so spark does not support parquet V2  atm ?, as We have a use
>> case where we need parquet V2 as  one of our components uses Parquet V2 .
>>
>> On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue  wrote:
>>
>>> Hi Prem,
>>>
>>> Parquet v1 is the default because v2 has not been finalized and adopted
>>> by the community. I highly recommend not using v2 encodings at this time.
>>>
>>> Ryan
>>>
>>> On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo  wrote:
>>>
 I am using spark 3.2.0 . but my spark package comes with parquet-mr
 1.2.1 which writes in parquet version 1 not version version 2:(. so I was
 looking how to write in Parquet version2 ?

 On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Sorry you have a point there. It was released in version 3.00. What
> version of spark are you using?
>
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner
> Von Braun
> )".
>
>
> On Mon, 15 Apr 2024 at 21:33, Prem Sahoo  wrote:
>
>> Thank you so much for the info! But do we have any release notes
>> where it says spark2.4.0 onwards supports parquet version 2. I was under
>> the impression Spark3.0 onwards it started supporting .
>>
>>
>>
>>
>> On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Well if I am correct, Parquet version 2 support was introduced in
>>> Spark version 2.4.0. Therefore, any version of Spark starting from 2.4.0
>>> supports Parquet version 2. Assuming that you are using Spark version
>>> 2.4.0 or later, you should be able to take advantage of Parquet version 
>>> 2
>>> features.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Mon, 15 Apr 2024 at 20:53, Prem Sahoo 
>>> wrote:
>>>
 Thank you for the information!
 I can use any version of parquet-mr to produce parquet file.

 regarding 2nd question .
 Which version of spark is supporting parquet version 2?
 May I get the release notes where parquet versions are mentioned ?


 On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Parquet-mr is a Java library that provides functionality for
> working with Parquet files with hadoop. It is therefore  more geared
> towards working with Parquet files within the Hadoop ecosystem,
> particularly using MapReduce jobs. There is no definitive way to check
> exact compatible versions within the library itself. However, you can 
> have
> a look at this
>
> https://github.com/apache/parquet-mr/blob/master/CHANGES.md
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Ryan Blue
Spark will read data written with v2 encodings just fine. You just don't
need to worry about making Spark produce v2. And you should probably also
not produce v2 encodings from other systems.

On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo  wrote:

> oops but so spark does not support parquet V2  atm ?, as We have a use
> case where we need parquet V2 as  one of our components uses Parquet V2 .
>
> On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue  wrote:
>
>> Hi Prem,
>>
>> Parquet v1 is the default because v2 has not been finalized and adopted
>> by the community. I highly recommend not using v2 encodings at this time.
>>
>> Ryan
>>
>> On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo  wrote:
>>
>>> I am using spark 3.2.0 . but my spark package comes with parquet-mr
>>> 1.2.1 which writes in parquet version 1 not version version 2:(. so I was
>>> looking how to write in Parquet version2 ?
>>>
>>> On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Sorry you have a point there. It was released in version 3.00. What
 version of spark are you using?

 Technologist | Solutions Architect | Data Engineer  | Generative AI
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* The information provided is correct to the best of my
 knowledge but of course cannot be guaranteed . It is essential to note
 that, as with any advice, quote "one test result is worth one-thousand
 expert opinions (Werner
 Von Braun
 )".


 On Mon, 15 Apr 2024 at 21:33, Prem Sahoo  wrote:

> Thank you so much for the info! But do we have any release notes where
> it says spark2.4.0 onwards supports parquet version 2. I was under the
> impression Spark3.0 onwards it started supporting .
>
>
>
>
> On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Well if I am correct, Parquet version 2 support was introduced in
>> Spark version 2.4.0. Therefore, any version of Spark starting from 2.4.0
>> supports Parquet version 2. Assuming that you are using Spark version
>> 2.4.0 or later, you should be able to take advantage of Parquet version 2
>> features.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 15 Apr 2024 at 20:53, Prem Sahoo 
>> wrote:
>>
>>> Thank you for the information!
>>> I can use any version of parquet-mr to produce parquet file.
>>>
>>> regarding 2nd question .
>>> Which version of spark is supporting parquet version 2?
>>> May I get the release notes where parquet versions are mentioned ?
>>>
>>>
>>> On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Parquet-mr is a Java library that provides functionality for
 working with Parquet files with hadoop. It is therefore  more geared
 towards working with Parquet files within the Hadoop ecosystem,
 particularly using MapReduce jobs. There is no definitive way to check
 exact compatible versions within the library itself. However, you can 
 have
 a look at this

 https://github.com/apache/parquet-mr/blob/master/CHANGES.md

 HTH

 Mich Talebzadeh,
 Technologist | Solutions Architect | Data Engineer  | Generative AI
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* The information provided is correct to the best of
 my knowledge but of course cannot be guaranteed . It is essential to 
 note
 that, as with any advice, quote "one test result is worth one-thousand
 expert opinions (Werner
 Von Braun
 

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
oops but so spark does not support parquet V2  atm ?, as We have a use case
where we need parquet V2 as  one of our components uses Parquet V2 .

On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue  wrote:

> Hi Prem,
>
> Parquet v1 is the default because v2 has not been finalized and adopted by
> the community. I highly recommend not using v2 encodings at this time.
>
> Ryan
>
> On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo  wrote:
>
>> I am using spark 3.2.0 . but my spark package comes with parquet-mr 1.2.1
>> which writes in parquet version 1 not version version 2:(. so I was looking
>> how to write in Parquet version2 ?
>>
>> On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Sorry you have a point there. It was released in version 3.00. What
>>> version of spark are you using?
>>>
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Mon, 15 Apr 2024 at 21:33, Prem Sahoo  wrote:
>>>
 Thank you so much for the info! But do we have any release notes where
 it says spark2.4.0 onwards supports parquet version 2. I was under the
 impression Spark3.0 onwards it started supporting .




 On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Well if I am correct, Parquet version 2 support was introduced in
> Spark version 2.4.0. Therefore, any version of Spark starting from 2.4.0
> supports Parquet version 2. Assuming that you are using Spark version
> 2.4.0 or later, you should be able to take advantage of Parquet version 2
> features.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner
> Von Braun
> )".
>
>
> On Mon, 15 Apr 2024 at 20:53, Prem Sahoo  wrote:
>
>> Thank you for the information!
>> I can use any version of parquet-mr to produce parquet file.
>>
>> regarding 2nd question .
>> Which version of spark is supporting parquet version 2?
>> May I get the release notes where parquet versions are mentioned ?
>>
>>
>> On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Parquet-mr is a Java library that provides functionality for
>>> working with Parquet files with hadoop. It is therefore  more geared
>>> towards working with Parquet files within the Hadoop ecosystem,
>>> particularly using MapReduce jobs. There is no definitive way to check
>>> exact compatible versions within the library itself. However, you can 
>>> have
>>> a look at this
>>>
>>> https://github.com/apache/parquet-mr/blob/master/CHANGES.md
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Mon, 15 Apr 2024 at 18:59, Prem Sahoo 
>>> wrote:
>>>
 Hello Team,
 May I know how to check which version of parquet is supported by
 parquet-mr 1.2.1 ?

 Which version of parquet-mr is supporting parquet version 2 (V2) ?

 Which version of spark is 

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Ryan Blue
Hi Prem,

Parquet v1 is the default because v2 has not been finalized and adopted by
the community. I highly recommend not using v2 encodings at this time.

Ryan

On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo  wrote:

> I am using spark 3.2.0 . but my spark package comes with parquet-mr 1.2.1
> which writes in parquet version 1 not version version 2:(. so I was looking
> how to write in Parquet version2 ?
>
> On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh 
> wrote:
>
>> Sorry you have a point there. It was released in version 3.00. What
>> version of spark are you using?
>>
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 15 Apr 2024 at 21:33, Prem Sahoo  wrote:
>>
>>> Thank you so much for the info! But do we have any release notes where
>>> it says spark2.4.0 onwards supports parquet version 2. I was under the
>>> impression Spark3.0 onwards it started supporting .
>>>
>>>
>>>
>>>
>>> On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Well if I am correct, Parquet version 2 support was introduced in Spark
 version 2.4.0. Therefore, any version of Spark starting from 2.4.0 supports
 Parquet version 2. Assuming that you are using Spark version  2.4.0 or
 later, you should be able to take advantage of Parquet version 2 features.

 HTH

 Mich Talebzadeh,
 Technologist | Solutions Architect | Data Engineer  | Generative AI
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* The information provided is correct to the best of my
 knowledge but of course cannot be guaranteed . It is essential to note
 that, as with any advice, quote "one test result is worth one-thousand
 expert opinions (Werner
 Von Braun
 )".


 On Mon, 15 Apr 2024 at 20:53, Prem Sahoo  wrote:

> Thank you for the information!
> I can use any version of parquet-mr to produce parquet file.
>
> regarding 2nd question .
> Which version of spark is supporting parquet version 2?
> May I get the release notes where parquet versions are mentioned ?
>
>
> On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Parquet-mr is a Java library that provides functionality for working
>> with Parquet files with hadoop. It is therefore  more geared towards
>> working with Parquet files within the Hadoop ecosystem, particularly 
>> using
>> MapReduce jobs. There is no definitive way to check exact compatible
>> versions within the library itself. However, you can have a look at this
>>
>> https://github.com/apache/parquet-mr/blob/master/CHANGES.md
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 15 Apr 2024 at 18:59, Prem Sahoo 
>> wrote:
>>
>>> Hello Team,
>>> May I know how to check which version of parquet is supported by
>>> parquet-mr 1.2.1 ?
>>>
>>> Which version of parquet-mr is supporting parquet version 2 (V2) ?
>>>
>>> Which version of spark is supporting parquet version 2?
>>> May I get the release notes where parquet versions are mentioned ?
>>>
>>

-- 
Ryan Blue
Tabular


Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
I am using spark 3.2.0 . but my spark package comes with parquet-mr 1.2.1
which writes in parquet version 1 not version version 2:(. so I was looking
how to write in Parquet version2 ?

On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh 
wrote:

> Sorry you have a point there. It was released in version 3.00. What
> version of spark are you using?
>
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 15 Apr 2024 at 21:33, Prem Sahoo  wrote:
>
>> Thank you so much for the info! But do we have any release notes where it
>> says spark2.4.0 onwards supports parquet version 2. I was under the
>> impression Spark3.0 onwards it started supporting .
>>
>>
>>
>>
>> On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Well if I am correct, Parquet version 2 support was introduced in Spark
>>> version 2.4.0. Therefore, any version of Spark starting from 2.4.0 supports
>>> Parquet version 2. Assuming that you are using Spark version  2.4.0 or
>>> later, you should be able to take advantage of Parquet version 2 features.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Mon, 15 Apr 2024 at 20:53, Prem Sahoo  wrote:
>>>
 Thank you for the information!
 I can use any version of parquet-mr to produce parquet file.

 regarding 2nd question .
 Which version of spark is supporting parquet version 2?
 May I get the release notes where parquet versions are mentioned ?


 On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Parquet-mr is a Java library that provides functionality for working
> with Parquet files with hadoop. It is therefore  more geared towards
> working with Parquet files within the Hadoop ecosystem, particularly using
> MapReduce jobs. There is no definitive way to check exact compatible
> versions within the library itself. However, you can have a look at this
>
> https://github.com/apache/parquet-mr/blob/master/CHANGES.md
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner
> Von Braun
> )".
>
>
> On Mon, 15 Apr 2024 at 18:59, Prem Sahoo  wrote:
>
>> Hello Team,
>> May I know how to check which version of parquet is supported by
>> parquet-mr 1.2.1 ?
>>
>> Which version of parquet-mr is supporting parquet version 2 (V2) ?
>>
>> Which version of spark is supporting parquet version 2?
>> May I get the release notes where parquet versions are mentioned ?
>>
>


Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Mich Talebzadeh
Sorry you have a point there. It was released in version 3.00. What version
of spark are you using?

Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 15 Apr 2024 at 21:33, Prem Sahoo  wrote:

> Thank you so much for the info! But do we have any release notes where it
> says spark2.4.0 onwards supports parquet version 2. I was under the
> impression Spark3.0 onwards it started supporting .
>
>
>
>
> On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh 
> wrote:
>
>> Well if I am correct, Parquet version 2 support was introduced in Spark
>> version 2.4.0. Therefore, any version of Spark starting from 2.4.0 supports
>> Parquet version 2. Assuming that you are using Spark version  2.4.0 or
>> later, you should be able to take advantage of Parquet version 2 features.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 15 Apr 2024 at 20:53, Prem Sahoo  wrote:
>>
>>> Thank you for the information!
>>> I can use any version of parquet-mr to produce parquet file.
>>>
>>> regarding 2nd question .
>>> Which version of spark is supporting parquet version 2?
>>> May I get the release notes where parquet versions are mentioned ?
>>>
>>>
>>> On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Parquet-mr is a Java library that provides functionality for working
 with Parquet files with hadoop. It is therefore  more geared towards
 working with Parquet files within the Hadoop ecosystem, particularly using
 MapReduce jobs. There is no definitive way to check exact compatible
 versions within the library itself. However, you can have a look at this

 https://github.com/apache/parquet-mr/blob/master/CHANGES.md

 HTH

 Mich Talebzadeh,
 Technologist | Solutions Architect | Data Engineer  | Generative AI
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* The information provided is correct to the best of my
 knowledge but of course cannot be guaranteed . It is essential to note
 that, as with any advice, quote "one test result is worth one-thousand
 expert opinions (Werner
 Von Braun
 )".


 On Mon, 15 Apr 2024 at 18:59, Prem Sahoo  wrote:

> Hello Team,
> May I know how to check which version of parquet is supported by
> parquet-mr 1.2.1 ?
>
> Which version of parquet-mr is supporting parquet version 2 (V2) ?
>
> Which version of spark is supporting parquet version 2?
> May I get the release notes where parquet versions are mentioned ?
>



Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
Thank you so much for the info! But do we have any release notes where it
says spark2.4.0 onwards supports parquet version 2. I was under the
impression Spark3.0 onwards it started supporting .




On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh 
wrote:

> Well if I am correct, Parquet version 2 support was introduced in Spark
> version 2.4.0. Therefore, any version of Spark starting from 2.4.0 supports
> Parquet version 2. Assuming that you are using Spark version  2.4.0 or
> later, you should be able to take advantage of Parquet version 2 features.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 15 Apr 2024 at 20:53, Prem Sahoo  wrote:
>
>> Thank you for the information!
>> I can use any version of parquet-mr to produce parquet file.
>>
>> regarding 2nd question .
>> Which version of spark is supporting parquet version 2?
>> May I get the release notes where parquet versions are mentioned ?
>>
>>
>> On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Parquet-mr is a Java library that provides functionality for working
>>> with Parquet files with hadoop. It is therefore  more geared towards
>>> working with Parquet files within the Hadoop ecosystem, particularly using
>>> MapReduce jobs. There is no definitive way to check exact compatible
>>> versions within the library itself. However, you can have a look at this
>>>
>>> https://github.com/apache/parquet-mr/blob/master/CHANGES.md
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Mon, 15 Apr 2024 at 18:59, Prem Sahoo  wrote:
>>>
 Hello Team,
 May I know how to check which version of parquet is supported by
 parquet-mr 1.2.1 ?

 Which version of parquet-mr is supporting parquet version 2 (V2) ?

 Which version of spark is supporting parquet version 2?
 May I get the release notes where parquet versions are mentioned ?

>>>


Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Mich Talebzadeh
Well if I am correct, Parquet version 2 support was introduced in Spark
version 2.4.0. Therefore, any version of Spark starting from 2.4.0 supports
Parquet version 2. Assuming that you are using Spark version  2.4.0 or
later, you should be able to take advantage of Parquet version 2 features.

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 15 Apr 2024 at 20:53, Prem Sahoo  wrote:

> Thank you for the information!
> I can use any version of parquet-mr to produce parquet file.
>
> regarding 2nd question .
> Which version of spark is supporting parquet version 2?
> May I get the release notes where parquet versions are mentioned ?
>
>
> On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh 
> wrote:
>
>> Parquet-mr is a Java library that provides functionality for working
>> with Parquet files with hadoop. It is therefore  more geared towards
>> working with Parquet files within the Hadoop ecosystem, particularly using
>> MapReduce jobs. There is no definitive way to check exact compatible
>> versions within the library itself. However, you can have a look at this
>>
>> https://github.com/apache/parquet-mr/blob/master/CHANGES.md
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 15 Apr 2024 at 18:59, Prem Sahoo  wrote:
>>
>>> Hello Team,
>>> May I know how to check which version of parquet is supported by
>>> parquet-mr 1.2.1 ?
>>>
>>> Which version of parquet-mr is supporting parquet version 2 (V2) ?
>>>
>>> Which version of spark is supporting parquet version 2?
>>> May I get the release notes where parquet versions are mentioned ?
>>>
>>


Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
Thank you for the information!
I can use any version of parquet-mr to produce parquet file.

regarding 2nd question .
Which version of spark is supporting parquet version 2?
May I get the release notes where parquet versions are mentioned ?


On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh 
wrote:

> Parquet-mr is a Java library that provides functionality for working with
> Parquet files with hadoop. It is therefore  more geared towards working
> with Parquet files within the Hadoop ecosystem, particularly using
> MapReduce jobs. There is no definitive way to check exact compatible
> versions within the library itself. However, you can have a look at this
>
> https://github.com/apache/parquet-mr/blob/master/CHANGES.md
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 15 Apr 2024 at 18:59, Prem Sahoo  wrote:
>
>> Hello Team,
>> May I know how to check which version of parquet is supported by
>> parquet-mr 1.2.1 ?
>>
>> Which version of parquet-mr is supporting parquet version 2 (V2) ?
>>
>> Which version of spark is supporting parquet version 2?
>> May I get the release notes where parquet versions are mentioned ?
>>
>


Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Mich Talebzadeh
Parquet-mr is a Java library that provides functionality for working with
Parquet files with hadoop. It is therefore  more geared towards working
with Parquet files within the Hadoop ecosystem, particularly using
MapReduce jobs. There is no definitive way to check exact compatible
versions within the library itself. However, you can have a look at this

https://github.com/apache/parquet-mr/blob/master/CHANGES.md

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 15 Apr 2024 at 18:59, Prem Sahoo  wrote:

> Hello Team,
> May I know how to check which version of parquet is supported by
> parquet-mr 1.2.1 ?
>
> Which version of parquet-mr is supporting parquet version 2 (V2) ?
>
> Which version of spark is supporting parquet version 2?
> May I get the release notes where parquet versions are mentioned ?
>


Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
Hello Team,
May I know how to check which version of parquet is supported by parquet-mr
1.2.1 ?

Which version of parquet-mr is supporting parquet version 2 (V2) ?

Which version of spark is supporting parquet version 2?
May I get the release notes where parquet versions are mentioned ?