PR testing and flaky tests (triggering executions separately)

2021-03-26 Thread Attila Zsolt Piros
Hello everybody,

A commit for a PR triggers three different test executions:

   - SparkPullRequestBuilder
   - SparkPullRequestBuilder-K8s
   - Github actions triggered checks

Each one of these could contain a flaky test.


*Is it already possible to trigger these executions separately?*
If not I think we could save some computing resources and time by
introducing separate textual command (like having argument for "Jenkins
retest this" or use any other text as Github actions not Jenkins related).

So a commit would still trigger all the three but a manual trigger would be
available for the different executions separately.

A PR can be merged if all the three are successful but it should not matter
whether they were triggered together for a specific commit or separately as
long as we have the three check marks for the commit in question.

Best Regards,
Attila


Re: minikube and kubernetes cluster versions for integration testing

2021-03-14 Thread Attila Zsolt Piros
Thanks Shane!

As I promised:
- the PR about documenting the change

- my Spark PR  with checking
Minikube versions and using a simpler way to configure kubernetes client
for integration testing
- the Jira you asked about upgrading the infrastructure


Best Regards,
Attila

On Thu, Mar 4, 2021 at 7:29 PM shane knapp ☠  wrote:

> fwiw, upgrading minikube and the associated VM drivers is potentially a
> PITA.
>
> your PR will absolutely be tested before merging.  :)
>
> On Thu, Mar 4, 2021 at 10:13 AM attilapiros 
> wrote:
>
>> Thanks Shane!
>>
>> I can do the documentation task and the Minikube version check can be
>> incorporated into my PR.
>> When my PR is finalized (probably next week) I will create a jira for you
>> and you can set up the test systems and you can even test my PR before
>> merging it. Is this possible / fine for you?
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-11 Thread Attila Zsolt Piros
I agree with you to extend the documentation around this. Moreover I
support to have specific unit tests for this.

> There is clearly some demand for Spark to automatically clean up
checkpoints on shutdown

What about I suggested on the PR? To clean up the checkpoint directory at
shutdown one can register the directory to be deleted at exit:

 FileSystem fs = FileSystem.get(conf);
 fs.deleteOnExit(checkpointPath);

On Thu, Mar 11, 2021 at 6:06 PM Nicholas Chammas 
wrote:

> OK, perhaps the best course of action is to leave the current behavior
> as-is but clarify the documentation for `.checkpoint()` and/or
> `cleanCheckpoints`.
>
> I personally find it confusing that `cleanCheckpoints` doesn't address
> shutdown behavior, and the Stack Overflow links I shared
> <https://issues.apache.org/jira/browse/SPARK-33000> show that many people
> are in the same situation. There is clearly some demand for Spark to
> automatically clean up checkpoints on shutdown. But perhaps that should
> be... a new config? a rejected feature? something else? I dunno.
>
> Does anyone else have thoughts on how to approach this?
>
> On Wed, Mar 10, 2021 at 4:39 PM Attila Zsolt Piros <
> piros.attila.zs...@gmail.com> wrote:
>
>> > Checkpoint data is left behind after a normal shutdown, not just an
>> unexpected shutdown. The PR description includes a simple demonstration of
>> this.
>>
>> I think I might overemphasized a bit the "unexpected" adjective to show
>> you the value in the current behavior.
>>
>> The feature configured with
>> "spark.cleaner.referenceTracking.cleanCheckpoints" is about out of scoped
>> references without ANY shutdown.
>>
>> It would be hard to distinguish that level (ShutdownHookManager) the
>> unexpected from the intentional exits.
>> As the user code (run by driver) could contain a System.exit() which was
>> added by the developer for numerous reasons (this way distinguishing
>> unexpected and not unexpected is not really an option).
>> Even a third party library can contain s System.exit(). Would that be an
>> unexpected exit or intentional? You can see it is hard to tell.
>>
>> To test the real feature
>> behind "spark.cleaner.referenceTracking.cleanCheckpoints" you can create a
>> reference within a scope which is closed. For example within the body of a
>> function (without return value) and store it only in a local
>> variable. After the scope is closed in case of our function when the caller
>> gets the control back you have chance to see the context cleaner working
>> (you might even need to trigger a GC too).
>>
>> On Wed, Mar 10, 2021 at 10:09 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Checkpoint data is left behind after a normal shutdown, not just an
>>> unexpected shutdown. The PR description includes a simple demonstration of
>>> this.
>>>
>>> If the current behavior is truly intended -- which I find difficult to
>>> believe given how confusing
>>> <https://stackoverflow.com/q/52630858/877069> it
>>> <https://stackoverflow.com/q/60009856/877069> is
>>> <https://stackoverflow.com/q/61454740/877069> -- then at the very least
>>> we need to update the documentation for both `.checkpoint()` and
>>> `cleanCheckpoints` to make that clear.
>>>
>>> > This way even after an unexpected exit the next run of the same app
>>> should be able to pick up the checkpointed data.
>>>
>>> The use case you are describing potentially makes sense. But preserving
>>> checkpoint data after an unexpected shutdown -- even when
>>> `cleanCheckpoints` is set to true -- is a new guarantee that is not
>>> currently expressed in the API or documentation. At least as far as I can
>>> tell.
>>>
>>> On Wed, Mar 10, 2021 at 3:10 PM Attila Zsolt Piros <
>>> piros.attila.zs...@gmail.com> wrote:
>>>
>>>> Hi Nick!
>>>>
>>>> I am not sure you are fixing a problem here. I think what you see is as
>>>> problem is actually an intended behaviour.
>>>>
>>>> Checkpoint data should outlive the unexpected shutdowns. So there is a
>>>> very important difference between the reference goes out of scope during a
>>>> normal execution (in this case cleanup is expected depending on the config
>>>> you mentioned) and when a references goes out of scope because of an
>>>> unexpected error (in this case you should keep the checkpoint data).
>>>>
>>>> Th

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Attila Zsolt Piros
> Checkpoint data is left behind after a normal shutdown, not just an
unexpected shutdown. The PR description includes a simple demonstration of
this.

I think I might overemphasized a bit the "unexpected" adjective to show you
the value in the current behavior.

The feature configured with
"spark.cleaner.referenceTracking.cleanCheckpoints" is about out of scoped
references without ANY shutdown.

It would be hard to distinguish that level (ShutdownHookManager) the
unexpected from the intentional exits.
As the user code (run by driver) could contain a System.exit() which was
added by the developer for numerous reasons (this way distinguishing
unexpected and not unexpected is not really an option).
Even a third party library can contain s System.exit(). Would that be an
unexpected exit or intentional? You can see it is hard to tell.

To test the real feature
behind "spark.cleaner.referenceTracking.cleanCheckpoints" you can create a
reference within a scope which is closed. For example within the body of a
function (without return value) and store it only in a local
variable. After the scope is closed in case of our function when the caller
gets the control back you have chance to see the context cleaner working
(you might even need to trigger a GC too).

On Wed, Mar 10, 2021 at 10:09 PM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Checkpoint data is left behind after a normal shutdown, not just an
> unexpected shutdown. The PR description includes a simple demonstration of
> this.
>
> If the current behavior is truly intended -- which I find difficult to
> believe given how confusing <https://stackoverflow.com/q/52630858/877069>
> it <https://stackoverflow.com/q/60009856/877069> is
> <https://stackoverflow.com/q/61454740/877069> -- then at the very least
> we need to update the documentation for both `.checkpoint()` and
> `cleanCheckpoints` to make that clear.
>
> > This way even after an unexpected exit the next run of the same app
> should be able to pick up the checkpointed data.
>
> The use case you are describing potentially makes sense. But preserving
> checkpoint data after an unexpected shutdown -- even when
> `cleanCheckpoints` is set to true -- is a new guarantee that is not
> currently expressed in the API or documentation. At least as far as I can
> tell.
>
> On Wed, Mar 10, 2021 at 3:10 PM Attila Zsolt Piros <
> piros.attila.zs...@gmail.com> wrote:
>
>> Hi Nick!
>>
>> I am not sure you are fixing a problem here. I think what you see is as
>> problem is actually an intended behaviour.
>>
>> Checkpoint data should outlive the unexpected shutdowns. So there is a
>> very important difference between the reference goes out of scope during a
>> normal execution (in this case cleanup is expected depending on the config
>> you mentioned) and when a references goes out of scope because of an
>> unexpected error (in this case you should keep the checkpoint data).
>>
>> This way even after an unexpected exit the next run of the same app
>> should be able to pick up the checkpointed data.
>>
>> Best Regards,
>> Attila
>>
>>
>>
>>
>> On Wed, Mar 10, 2021 at 8:10 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Hello people,
>>>
>>> I'm working on a fix for SPARK-33000
>>> <https://issues.apache.org/jira/browse/SPARK-33000>. Spark does not
>>> cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate
>>> configs are set.
>>>
>>> In the course of developing a fix, another contributor pointed out
>>> <https://github.com/apache/spark/pull/31742#issuecomment-790987483>
>>> that checkpointed data may not be the only type of resource that needs a
>>> fix for shutdown cleanup.
>>>
>>> I'm looking for a committer who might have an opinion on how Spark
>>> should clean up disk-based resources on shutdown. The last people who
>>> contributed significantly to the ContextCleaner, where this cleanup
>>> happens, were @witgo <https://github.com/witgo> and @andrewor14
>>> <https://github.com/andrewor14>. But that was ~6 years ago, and I don't
>>> think they are active on the project anymore.
>>>
>>> Any takers to take a look and give their thoughts? The PR is small
>>> <https://github.com/apache/spark/pull/31742>. +39 / -2.
>>>
>>> Nick
>>>
>>>


Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Attila Zsolt Piros
Hi Nick!

I am not sure you are fixing a problem here. I think what you see is as
problem is actually an intended behaviour.

Checkpoint data should outlive the unexpected shutdowns. So there is a very
important difference between the reference goes out of scope during a
normal execution (in this case cleanup is expected depending on the config
you mentioned) and when a references goes out of scope because of an
unexpected error (in this case you should keep the checkpoint data).

This way even after an unexpected exit the next run of the same app should
be able to pick up the checkpointed data.

Best Regards,
Attila




On Wed, Mar 10, 2021 at 8:10 PM Nicholas Chammas 
wrote:

> Hello people,
>
> I'm working on a fix for SPARK-33000
> . Spark does not
> cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate
> configs are set.
>
> In the course of developing a fix, another contributor pointed out
>  that
> checkpointed data may not be the only type of resource that needs a fix for
> shutdown cleanup.
>
> I'm looking for a committer who might have an opinion on how Spark should
> clean up disk-based resources on shutdown. The last people who contributed
> significantly to the ContextCleaner, where this cleanup happens, were
> @witgo  and @andrewor14
> . But that was ~6 years ago, and I don't
> think they are active on the project anymore.
>
> Any takers to take a look and give their thoughts? The PR is small
> . +39 / -2.
>
> Nick
>
>


minikube and kubernetes cluster versions for integration testing

2021-03-02 Thread Attila Zsolt Piros
Hi All,

I am working on PR to change kubernetes integration testing and use the
`minikube kubectl -- config view --minify` output to build the kubernetes
client config.
This solution has the advantage of not using hardcoded values like 8443 for
server port (which is wrong when the vm-driver is docker as the port in
that case is 32788 by default).

But my question is bit more generic than my PR. It is about the supported
Minikube versions and kubernetes cluster version this why I decided to
write this mail.

To test this new solution I have created shell script to install each
Minikube versions one by one, start a kubernetes cluster and view the
config with the command above.
Running the test I found some issues.

Currently for k8s testing we suggest to use *minikube version v0.34.1 or
greater* with *kubernetes version v1.15.12* (for details check "Testing
K8S" section in the developer tools page
).


*The following three findings I have:*
1) Looking the Minikube documentation I came across an advice

about checking which kubernetes cluster versions are supported for a
Minikube version:


*"For up to date information on supported versions,
see OldestKubernetesVersion and NewestKubernetesVersion in constants.go
"*
I think it would be a good idea to follow the official support matrix
of Minikube so I have collected some relevant versions into this table (the
link navigates to the relevant lines in `constants.go`):
 |   kubernetes version   |
minikube version |oldest|  newest  | default  |

v0.34.1

 |???   |???   | v1.13.3  |
v1.1.0 (22 May 2019)

| v1.10.13 | v1.14.2  | v1.14.2  |
v1.2.0

  | v1.10.13 | v1.15.0  | v1.15.0  |
v1.3.0 (6 Aug 2019)

 | v1.10.13 | v1.15.2  | v1.15.2  |
v1.6.0 (11 Dec 2019)

| v1.11.10 | v1.17.0  | v1.17.0  |
v1.7.3 (8 Feb 2020)

| v1.11.10 | v1.17.3  | v1.17.3  |
v1.13.1

 | v1.13.0  | v1.19.2  | v1.19.2  |
v1.17.1

| v1.13.0  | v1.20.2  | v1.20.3-rc.0 |


Looking this we can see if we intend to support v1.15.12 as kubernetes
version we should drop everything under v1.3.0.

2) I would suggest to drop v1.15.12 as kubernetes
version version because of this issue
 (I just found it by
running my script).

3) On Minikube v1.7.2 there is this permission denied issue
 so I suggest to
support Minikube version 1.7.3 and greater.

My test script is check_minikube_versions.zsh
.
It
was executed on Mac but with a simple sed expression it can be tailored to
linux too.



*After all of this my questions:*
*A) What about to change the required versions and suggest to use
kubernetes v1.17.3 and Minikube v1.7.3 and greater for integration testing?*

I would chose v1.17.3 for k8s cluster as that is the newest supported k8s
version for that Minikube v1.7.3 (hoping it will be good for us for a long
time).
If you agree with this suggestion I go ahead and update the relevant
documentation.



*B) How about extending the integration test to check whether the Minikube
version is sufficient? *By this we can provide a meaningful error when it
is violated.

Bests,
Attila


Re: Please use Jekyll via "bundle exec" from now on

2021-02-23 Thread Attila Zsolt Piros
With this commit
<https://github.com/apache/spark-website/commit/1f275fe7e6ee605165bfed2cb6c5d7d2558d8c4d>
a github workflow introduced for the PRs of the *spark-website* repo.

It contains an action to check:

   - whether the generation was complete (it contains all the generated
   HTMLs for the last versions of the markdown files)
   - whether the right version of Jekyll was used


On Thu, Feb 18, 2021 at 10:17 AM Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:

> Hello everybody,
>
> To pin the exact same version of Jekyll across all the contributors, Ruby
> Bundler is introduced.
> This way the differences in the generated documentation, which were caused
> by using different Jekyll versions, are avoided.
>
> Regarding its usage this simply means an extra prefix "*bundle exec*"
> must be added to call Jekyll, so:
> - instead of "jekyll build" please use "*bundle exec jekyll build*"
> - instead of "jekyll serve --watch" please use "*bundle exec jekyll serve
> --watch*"
>
> If you are using an earlier Ruby version please install Bundler by "*gem
> install bundler*" (as of Ruby 2.6 Bundler is part of core Ruby).
>
> This applies to both "apache/spark" and "apache/spark-website"
> repositories where all the documentations are updated accordingly.
>
> For details please check the PRs introducing this:
> - https://github.com/apache/spark/pull/31559
> - https://github.com/apache/spark-website/pull/303
>
> Sincerely,
> Attila Piros
>


Please use Jekyll via "bundle exec" from now on

2021-02-18 Thread Attila Zsolt Piros
Hello everybody,

To pin the exact same version of Jekyll across all the contributors, Ruby
Bundler is introduced.
This way the differences in the generated documentation, which were caused
by using different Jekyll versions, are avoided.

Regarding its usage this simply means an extra prefix "*bundle exec*" must
be added to call Jekyll, so:
- instead of "jekyll build" please use "*bundle exec jekyll build*"
- instead of "jekyll serve --watch" please use "*bundle exec jekyll serve
--watch*"

If you are using an earlier Ruby version please install Bundler by "*gem
install bundler*" (as of Ruby 2.6 Bundler is part of core Ruby).

This applies to both "apache/spark" and "apache/spark-website" repositories
where all the documentations are updated accordingly.

For details please check the PRs introducing this:
- https://github.com/apache/spark/pull/31559
- https://github.com/apache/spark-website/pull/303

Sincerely,
Attila Piros