Re: [DISCUSS] Un-deprecate Trigger.Once

2024-04-19 Thread Dongjoon Hyun
For that case, I believe it's enough for us to revise the deprecation
message only by making sure that Apache Spark will keep it without removal
for backward-compatibility purposes only. That's what the users asked,
isn't that?

> deprecation  of Trigger.Once confuses users that the trigger won't be
available sooner (though we rarely remove public API).

The feature was deprecated in Apache Spark 3.4.0 and `Undeprecation(?)` may
cause another confusion in the community, not only for Trigger.Once but
also for all historic `Deprecated` items.

Dongjoon.


On Fri, Apr 19, 2024 at 7:44 PM Jungtaek Lim 
wrote:

> Hi dev,
>
> I'd like to raise a discussion to un-deprecate Trigger.Once in future
> releases.
>
> I've proposed deprecation of Trigger.Once because it's semantically broken
> and we made a change, but we've realized that there are really users who
> strictly require the behavior of Trigger.Once (only run a single batch in
> whatever reason) despite the semantic issue, and workaround with
> Trigger.AvailableNow is arguably much more hacky or sometimes not even
> possible.
>
> I still think we have to advise using Trigger.AvailableNow whenever
> feasible, but deprecation  of Trigger.Once confuses users that the trigger
> won't be available sooner (though we rarely remove public API). So maybe
> warning log on usage sounds to me as a reasonable alternative.
>
> Thoughts?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>


[DISCUSS] Un-deprecate Trigger.Once

2024-04-19 Thread Jungtaek Lim
Hi dev,

I'd like to raise a discussion to un-deprecate Trigger.Once in future
releases.

I've proposed deprecation of Trigger.Once because it's semantically broken
and we made a change, but we've realized that there are really users who
strictly require the behavior of Trigger.Once (only run a single batch in
whatever reason) despite the semantic issue, and workaround with
Trigger.AvailableNow is arguably much more hacky or sometimes not even
possible.

I still think we have to advise using Trigger.AvailableNow whenever
feasible, but deprecation  of Trigger.Once confuses users that the trigger
won't be available sooner (though we rarely remove public API). So maybe
warning log on usage sounds to me as a reasonable alternative.

Thoughts?

Thanks,
Jungtaek Lim (HeartSaVioR)


[DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-04-19 Thread Anton Okolnychyi
Hi folks,

I'd like to start a discussion on SPARK-44167 that aims to enable catalogs
to expose custom routines as stored procedures. I believe this
functionality will enhance Spark’s ability to interact with external
connectors and allow users to perform more operations in plain SQL.

SPIP [1] contains proposed API changes and parser extensions. Any feedback
is more than welcome!

Unlike the initial proposal for stored procedures with Python [2], this one
focuses on exposing pre-defined stored procedures via the catalog API. This
approach is inspired by a similar functionality in Trino and avoids the
challenges of supporting user-defined routines discussed earlier [3].

Liang-Chi was kind enough to shepherd this effort. Thanks!

- Anton

[1] -
https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
[2] -
https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/
[3] - https://lists.apache.org/thread/lkjm9r7rx7358xxn2z8yof4wdknpzg3l


Re: Which version of spark version supports parquet version 2 ?

2024-04-19 Thread Steve Loughran
Those are some quite good improvements -but committing to storing all your
data in an unstable format, is, well, "bold". For temporary data as part of
a workflow though, it could be appealing

Now, assuming you are going to be working with s3, you might want to start
with merging PARQUET-2117 into your version, as it is delivering tangible
speedups through parallel GET of different range downloads, at least
according to our most recent test runs.

[image: Screenshot 2024-04-12 at 11.38.02 AM.png]

what would be interesting is to see how the two combine: v2 and java21 AVX
processing of data, plus the 4X improvement in data retrieval (we limit the
#of active requests per stream to realistic numbers you can use in
production, FWIW).

See also; An Empirical Evaluation of Columnar Storage Formats
https://arxiv.org/abs/2304.05028

On Thu, 18 Apr 2024 at 08:31, Bjørn Jørgensen 
wrote:

> " *Release 24.3 of Dremio will continue to write Parquet V1, since an
> average performance degradation of 1.5% was observed in writes and 6.5% was
> observed in queries when TPC-DS data was written using Parquet V2 instead
> of Parquet V1.  The aforementioned query performance tests utilized the C3
> cache to store data.*"
> (...)
> "*Users can enable Parquet V2 on write using the following configuration
> key.*
>
> ALTER SYSTEM SET "store.parquet.writer.version" = 'v2' "
>
> https://www.dremio.com/blog/vectorized-reading-of-parquet-v2-improves-performance-up-to-75/
>
> "*Java Vector API support*
>
>
>
>
>
>
>
>
>
>
>
> *The feature is experimental and is currently not part of the parquet
> distribution. Parquet-MR has supported Java Vector API to speed up reading,
> to enable this feature:Java 17+, 64-bitRequiring the CPU to support
> instruction sets:avx512vbmiavx512_vbmi2To build the jars: mvn clean package
> -P vector-pluginsFor Apache Spark to enable this feature:Build parquet and
> replace the parquet-encoding-{VERSION}.jar on the spark jars folderBuild
> parquet-encoding-vector and copy parquet-encoding-vector-{VERSION}.jar to
> the spark jars folderEdit spark class#VectorizedRleValuesReader,
> function#readNextGroup refer to parquet class#ParquetReadRouter,
> function#readBatchUsing512VectorBuild spark with maven and replace
> spark-sql_2.12-{VERSION}.jar on the spark jars folder*"
>
>
> https://github.com/apache/parquet-mr?tab=readme-ov-file#java-vector-api-support
>
> You are using spark 3.2.0
> spark version 3.2.4 was released April 13, 2023
> https://spark.apache.org/releases/spark-release-3-2-4.html
> You are using a spark version that is EOL.
>
> tor. 18. apr. 2024 kl. 00:25 skrev Prem Sahoo :
>
>> Hello Ryan,
>> May I know how you can write Parquet V2 encoding from spark 3.2.0 ?  As
>> per my knowledge Dremio is creating and reading Parquet V2.
>> "Apache Parquet-MR Writer version PARQUET_2_0, which is widely adopted
>> by engines that write Parquet data, supports delta encodings. However,
>> these encodings were not previously supported by Dremio's vectorized
>> Parquet reader, resulting in decreased speed. Now, in version 24.3 and
>> Dremio Cloud, when you use the Dremio SQL query engine on Parquet datasets,
>> you’ll receive best-in-class performance."
>>
>> Could you let me know where Parquet Community is not recommending Parquet
>> V2 ?
>>
>>
>>
>> On Wed, Apr 17, 2024 at 2:44 PM Ryan Blue  wrote:
>>
>>> Prem, as I said earlier, v2 is not a finalized spec so you should not
>>> use it. That's why it is not the default. You can get Spark to write v2
>>> files, but it isn't recommended by the Parquet community.
>>>
>>> On Wed, Apr 17, 2024 at 11:05 AM Prem Sahoo 
>>> wrote:
>>>
 Hello Community,
 Could anyone shed more light on this (Spark Supporting Parquet V2)?

 On Tue, Apr 16, 2024 at 3:42 PM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi Prem,
>
> Regrettably this is not my area of speciality. I trust
> another colleague will have a more informed idea. Alternatively you may
> raise an SPIP for it.
>
> Spark Project Improvement Proposals (SPIP) | Apache Spark
> 
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner
> Von Braun
> )".
>
>
> On Tue, 16 Apr 2024 at 18:17, Prem Sahoo  wrote:
>
>> Hello Mich,
>> Thanks for