Tools for regression testing

2022-03-21 Thread Mich Talebzadeh
Hi,

As a matter of interest do Spark releases deploy a specific regression
testing tool?

Thanks



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Apache Spark 3.3 Release

2022-03-21 Thread Tom Graves
 Maybe I'm miss understanding what you are saying, according to those dates 
code freeze, which should be majority of features are merged is March 15th. So 
if this list is all features and not merged at this point we should probably 
discuss if we want them to go in or if we need to change the dates.  Major 
features going in during QA period can destabilize things.
Tom
On Monday, March 21, 2022, 01:53:24 AM CDT, Wenchen Fan 
 wrote:  
 
 Just checked the release calendar, the planned RC cut date is April:
Let's revisit after 2 weeks then?
On Mon, Mar 21, 2022 at 2:47 PM Wenchen Fan  wrote:

Shall we revisit this list after a week? Ideally, they should be either merged 
or rejected for 3.3, so that we can cut rc1. We can still discuss them case by 
case at that time if there are exceptions.
On Sat, Mar 19, 2022 at 5:27 AM Dongjoon Hyun  wrote:

Thank you for your summarization.

I believe we need to have a discussion in order to evaluate each PR's readiness.

BTW, `branch-3.3` is still open for bug fixes including minor dependency 
changes like the following.

(Backported)[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4
Revert "[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4"
[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.5

(Upcoming)
[SPARK-38544][BUILD] Upgrade log4j2 to 2.17.2 from 2.17.1
[SPARK-38602][BUILD] Upgrade Kafka to 3.1.1 from 3.1.0
Dongjoon.


On Thu, Mar 17, 2022 at 11:22 PM Maxim Gekk  wrote:

Hi All,
Here is the allow list which I built based on your requests in this thread:   
   - SPARK-37396: Inline type hint files for files in python/pyspark/mllib
   - SPARK-37395: Inline type hint files for files in python/pyspark/ml
   - SPARK-37093: Inline type hints python/pyspark/streaming
   - SPARK-37377: Refactor V2 Partitioning interface and remove deprecated 
usage of Distribution
   - SPARK-38085: DataSource V2: Handle DELETE commands for group-based sources
   - SPARK-32268: Bloom Filter Join
   - SPARK-38548: New SQL function: try_sum
   - SPARK-37691: Support ANSI Aggregation Function: percentile_disc
   - SPARK-38063: Support SQL split_part function
   - SPARK-28516: Data Type Formatting Functions: `to_char`
   - SPARK-38432: Refactor framework so as JDBC dialect could compile filter by 
self way
   - SPARK-34863: Support nested column in Spark Parquet vectorized readers
   - SPARK-38194: Make Yarn memory overhead factor configurable
   - SPARK-37618: Support cleaning up shuffle blocks from external shuffle 
service
   - SPARK-37831: Add task partition id in metrics
   - SPARK-37974: Implement vectorized DELTA_BYTE_ARRAY and 
DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
   - SPARK-36664: Log time spent waiting for cluster resources
   - SPARK-34659: Web UI does not correctly get appId
   - SPARK-37650: Tell spark-env.sh the python interpreter
   - SPARK-38589: New SQL function: try_avg
   - SPARK-38590: New SQL function: try_to_binary   

   - SPARK-34079: Improvement CTE table scan   

Best regards,Max Gekk

On Thu, Mar 17, 2022 at 4:59 PM Tom Graves  wrote:

 Is the feature freeze target date March 22nd then?  I saw a few dates thrown 
around want to confirm what we landed on 
I am trying to get the following improvements finished review and in, if 
concerns with either, let me know:- [SPARK-34079][SQL] Merge non-correlated 
scalar subqueries- [SPARK-37618][CORE] Remove shuffle blocks using the shuffle 
service for released executors
Tom

On Thursday, March 17, 2022, 07:24:41 AM CDT, Gengliang Wang 
 wrote:  
 
 I'd like to add the following new SQL functions in the 3.3 release. These 
functions are useful when overflow or encoding errors occur:   
   - [SPARK-38548][SQL] New SQL function: try_sum    

   - [SPARK-38589][SQL] New SQL function: try_avg   

   - [SPARK-38590][SQL] New SQL function: try_to_binary    

Gengliang
On Thu, Mar 17, 2022 at 7:59 AM Andrew Melo  wrote:

Hello,

I've been trying for a bit to get the following two PRs merged and
into a release, and I'm having some difficulty moving them forward:

https://github.com/apache/spark/pull/34903 - This passes the current
python interpreter to spark-env.sh to allow some currently-unavailable
customization to happen
https://github.com/apache/spark/pull/31774 - This fixes a bug in the
SparkUI reverse proxy-handling code where it does a greedy match for
"proxy" in the URL, and will mistakenly replace the App-ID in the
wrong place.

I'm not exactly sure of how to get attention of PRs that have been
sitting around for a while, but these are really important to our
use-cases, and it would be nice to have them merged in.

Cheers
Andrew

On Wed, Mar 16, 2022 at 6:21 PM Holden Karau  wrote:
>
> I'd like to add/backport the logging in 
> https://github.com/apache/spark/pull/35881 PR so that when users submit 
> issues with dynamic allocation we can better debug what's going on.
>
> On Wed, Mar 16, 2022 at 3:45 PM Chao Sun  wrote:
>>
>> There is one item on our side that we want to backport to 3.3:
>> - vectorized D

Re: bazel and external/

2022-03-21 Thread Alkis Evlogimenos
Unless there are objections, I will update the PR tonight to rename
`external` to `connectors`.

On Mon, Mar 21, 2022 at 12:36 PM Wenchen Fan  wrote:

> How about renaming it to `connectors` if docker is the only exception and
> will be moved out?
>
> On Sat, Mar 19, 2022 at 6:18 PM Alkis Evlogimenos
>  wrote:
>
>> It looks like renaming the directory and moving components can be
>> separate steps. If there is consensus that connectors will move out, should
>> the directory be named misc for everything else until there is some
>> direction for the remaining modules?
>>
>> On Fri, 18 Mar 2022 at 03:03 Jungtaek Lim 
>> wrote:
>>
>>> Avro reader is technically a connector. We eventually called data source
>>> implementation "connector" as well; the package name in the catalyst
>>> represents it.
>>>
>>> Docker is something I'm not sure fits with the name "external". It
>>> probably deserves a top level directory now, since we start to release an
>>> official docker image. That does not seem to be an experimental one.
>>>
>>> Except Docker, all modules in the external directory are "sort of"
>>> connectors. Ganglia metric sink is an exception, but it is still a kind of
>>> connector for Dropwizard.
>>> (It might be interesting to see how many users are still using
>>> kinesis-asl and ganglia-lgpl modules. We have had almost no updates for
>>> DStream for several years.)
>>>
>>> If we agree with my proposal for docker, remaining is going to be
>>> effectively a rename. I don't have a strong opinion, just wanted to avoid
>>> the external directory to become/remain miscellaneous one.
>>>
>>> On Fri, Mar 18, 2022 at 10:04 AM Sean Owen  wrote:
>>>
 I sympathize, but might be less change to just rename the dir. There is
 more in there like the avro reader; it's kind of miscellaneous. I think we
 might want fewer rather than more top level dirs.

 On Thu, Mar 17, 2022 at 7:33 PM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> We seem to just focus on how to avoid the conflict with the name
> "external" used in bazel. Since we consider the possibility of renaming,
> why not revisit the modules "external" contains?
>
> Looks like kinds of the modules external directory contains are 1)
> Docker 2) Connectors 3) Sink on Dropwizard metrics (only ganglia here, and
> it seems to be just that Ganglia is LGPL)
>
> Would it make sense if each kind deserves a top directory? We can
> probably give better generalized names, and as a side-effect we will no
> longer have "external".
>
> On Fri, Mar 18, 2022 at 5:45 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for posting this, Alkis.
>>
>> Before the question (1) and (2), I'm curious if the Apache Spark
>> community has other downstreams using Bazel.
>>
>> To All. If there are some Bazel users with Apache Spark code, could
>> you share your practice? If you are using renaming, what is your renamed
>> directory name?
>>
>> Dongjoon.
>>
>>
>> On Thu, Mar 17, 2022 at 11:56 AM Alkis Evlogimenos
>>  wrote:
>>
>>> AFAIK there is not. `external` has been baked in bazel since the
>>> beginning and there is no plan from bazel devs to attempt to fix
>>> this
>>> 
>>> .
>>>
>>> On Thu, Mar 17, 2022 at 7:52 PM Sean Owen  wrote:
>>>
 Just checking - there is no way to tell bazel to look somewhere
 else for whatever 'external' means to it?
 It's a kinda big ugly change but it's not a functional change. If
 anything it might break some downstream builds that rely on the current
 structure too. But such is life for developers? I don't have a strong
 reason we can't.

 On Thu, Mar 17, 2022 at 1:47 PM Alkis Evlogimenos
  wrote:

> Hi Spark devs.
>
> The Apache Spark repo has a top level external/ directory. This is
> a reserved name for the bazel build system and it causes all sorts of
> problems: some can be worked around and some cannot (for some details 
> on
> one that cannot see
> https://github.com/hedronvision/bazel-compile-commands-extractor/issues/30
> ).
>
> Some forks of Apache Spark use bazel as a build system. It
> would be nice if we can make this change in Apache Spark without 
> resorting
> to complex renames/merges whenever changes are pulled from upstream.
>
> As such I proposed to rename external/ directory to want to rename
> the external/ directory to something else [SPARK-38569
> ]. I also sent
> a tentative [PR-35874 ]
> that renames external/ to vendor/.
>
> My questions to you a

Re: bazel and external/

2022-03-21 Thread Wenchen Fan
How about renaming it to `connectors` if docker is the only exception and
will be moved out?

On Sat, Mar 19, 2022 at 6:18 PM Alkis Evlogimenos
 wrote:

> It looks like renaming the directory and moving components can be separate
> steps. If there is consensus that connectors will move out, should the
> directory be named misc for everything else until there is some direction
> for the remaining modules?
>
> On Fri, 18 Mar 2022 at 03:03 Jungtaek Lim 
> wrote:
>
>> Avro reader is technically a connector. We eventually called data source
>> implementation "connector" as well; the package name in the catalyst
>> represents it.
>>
>> Docker is something I'm not sure fits with the name "external". It
>> probably deserves a top level directory now, since we start to release an
>> official docker image. That does not seem to be an experimental one.
>>
>> Except Docker, all modules in the external directory are "sort of"
>> connectors. Ganglia metric sink is an exception, but it is still a kind of
>> connector for Dropwizard.
>> (It might be interesting to see how many users are still using
>> kinesis-asl and ganglia-lgpl modules. We have had almost no updates for
>> DStream for several years.)
>>
>> If we agree with my proposal for docker, remaining is going to be
>> effectively a rename. I don't have a strong opinion, just wanted to avoid
>> the external directory to become/remain miscellaneous one.
>>
>> On Fri, Mar 18, 2022 at 10:04 AM Sean Owen  wrote:
>>
>>> I sympathize, but might be less change to just rename the dir. There is
>>> more in there like the avro reader; it's kind of miscellaneous. I think we
>>> might want fewer rather than more top level dirs.
>>>
>>> On Thu, Mar 17, 2022 at 7:33 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 We seem to just focus on how to avoid the conflict with the name
 "external" used in bazel. Since we consider the possibility of renaming,
 why not revisit the modules "external" contains?

 Looks like kinds of the modules external directory contains are 1)
 Docker 2) Connectors 3) Sink on Dropwizard metrics (only ganglia here, and
 it seems to be just that Ganglia is LGPL)

 Would it make sense if each kind deserves a top directory? We can
 probably give better generalized names, and as a side-effect we will no
 longer have "external".

 On Fri, Mar 18, 2022 at 5:45 AM Dongjoon Hyun 
 wrote:

> Thank you for posting this, Alkis.
>
> Before the question (1) and (2), I'm curious if the Apache Spark
> community has other downstreams using Bazel.
>
> To All. If there are some Bazel users with Apache Spark code, could
> you share your practice? If you are using renaming, what is your renamed
> directory name?
>
> Dongjoon.
>
>
> On Thu, Mar 17, 2022 at 11:56 AM Alkis Evlogimenos
>  wrote:
>
>> AFAIK there is not. `external` has been baked in bazel since the
>> beginning and there is no plan from bazel devs to attempt to fix this
>> 
>> .
>>
>> On Thu, Mar 17, 2022 at 7:52 PM Sean Owen  wrote:
>>
>>> Just checking - there is no way to tell bazel to look somewhere else
>>> for whatever 'external' means to it?
>>> It's a kinda big ugly change but it's not a functional change. If
>>> anything it might break some downstream builds that rely on the current
>>> structure too. But such is life for developers? I don't have a strong
>>> reason we can't.
>>>
>>> On Thu, Mar 17, 2022 at 1:47 PM Alkis Evlogimenos
>>>  wrote:
>>>
 Hi Spark devs.

 The Apache Spark repo has a top level external/ directory. This is
 a reserved name for the bazel build system and it causes all sorts of
 problems: some can be worked around and some cannot (for some details 
 on
 one that cannot see
 https://github.com/hedronvision/bazel-compile-commands-extractor/issues/30
 ).

 Some forks of Apache Spark use bazel as a build system. It would be
 nice if we can make this change in Apache Spark without resorting to
 complex renames/merges whenever changes are pulled from upstream.

 As such I proposed to rename external/ directory to want to rename
 the external/ directory to something else [SPARK-38569
 ]. I also sent
 a tentative [PR-35874 ]
 that renames external/ to vendor/.

 My questions to you are:
 1. Are there any objections to renaming external to X?
 2. Is vendor a good new name for external?

 Cheers,

>>>