Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread Hyukjin Kwon
Yeah, I tend to be positive about leveraging the Python type hints in
general.

However, just to clarify, I don’t think we should just port the type hints
into the main codes yet but maybe think about
having/porting Maciej's work, pyi files as stubs. For now, I tend to think
adding type hints to the codes make it difficult to backport or revert and
more difficult to discuss about typing only especially considering typing
is arguably premature yet.

It is also interesting to take a look at other projects and how they did
it. I took a look for the PySpark friends
such as pandas or NumPy. Seems

   - NumPy case had it as a separate project numpy-stubs and it was merged
   into the main project successfully as pyi files.
   - pandas case, I don’t see the work being done yet. I found an issue
   related to this but it seems closed.

Another important concern might be generic typing in Spark’s DataFrame as
an example. Looks like that’s also one of the concerns from pandas’.
For instance, how would we support variadic generic typing, for
example, DataFrame[int,
str, str] or DataFrame[a: int, b: str, c: str] ?
Last time I checked, Python didn’t support this. Presumably at least Python
from 3.6 to 3.8 wouldn't support.
I am experimentally trying this in another project that I am working on but
it requires a bunch of hacks and doesn’t play well with MyPy.

I currently don't have a strong feeling about it for now though I tend to
agree.
If we should do this, I would like to take a more conservative path such as
having some separation
for now e.g.) separate repo in Apache if feasible or separate module, and
then see how it goes and users like it.



2020년 7월 22일 (수) 오전 6:10, Driesprong, Fokko 님이 작성:

> Fully agree Holden, would be great to include the Outreachy project.
> Adding annotations is a very friendly way to get familiar with the codebase.
>
> I've also created a PR to see what's needed to get mypy in:
> https://github.com/apache/spark/pull/29180 From there on we can start
> adding annotations.
>
> Cheers, Fokko
>
>
> Op di 21 jul. 2020 om 21:40 schreef Holden Karau :
>
>> Yeah I think this could be a great project now that we're only Python
>> 3.5+. One potential is making this an Outreachy project to get more folks
>> from different backgrounds involved in Spark.
>>
>> On Tue, Jul 21, 2020 at 12:33 PM Driesprong, Fokko 
>> wrote:
>>
>>> Since we've recently dropped support for Python <=3.5
>>> , I think it would be nice
>>> to add support for type annotations. Having this in the main repository
>>> allows us to do type checking using MyPy  in the
>>> CI itself. 
>>>
>>> This is now handled by the Stub file:
>>> https://www.python.org/dev/peps/pep-0484/#stub-files However I think it
>>> is nicer to integrate the types with the code itself to keep everything in
>>> sync, and make it easier for the people who work on the codebase itself. A
>>> first step would be to move the stubs into the codebase. First step would
>>> be to cover the public API which is the most important one. Having the
>>> types with the code itself makes it much easier to understand. For example,
>>> if you can supply a str or column here:
>>> https://github.com/apache/spark/pull/29122/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2486
>>>
>>> One of the implications would be that future PR's on Python should cover
>>> annotations on the public API's. Curious what the rest of the community
>>> thinks.
>>>
>>> Cheers, Fokko
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Op di 21 jul. 2020 om 20:04 schreef zero323 :
>>>
 Given a discussion related to  SPARK-32320 PR
    I'd like to resurrect
 this
 thread. Is there any interest in migrating annotations to the main
 repository?



 --
 Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-21 Thread Holden Karau
Hi Folks,

In Spark SQL there is the ability to have Spark do it's partition
discovery/file listing in parallel on the worker nodes and also avoid
locality lookups. I'd like to expose this in core, but given the Hadoop
APIs it's a bit more complicated to do right. I made a quick POC and two
potential different paths we could do for implementation and wanted to see
if anyone had thoughts - https://github.com/apache/spark/pull/29179.

Cheers,

Holden

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread Driesprong, Fokko
Fully agree Holden, would be great to include the Outreachy project. Adding
annotations is a very friendly way to get familiar with the codebase.

I've also created a PR to see what's needed to get mypy in:
https://github.com/apache/spark/pull/29180 From there on we can start
adding annotations.

Cheers, Fokko


Op di 21 jul. 2020 om 21:40 schreef Holden Karau :

> Yeah I think this could be a great project now that we're only Python
> 3.5+. One potential is making this an Outreachy project to get more folks
> from different backgrounds involved in Spark.
>
> On Tue, Jul 21, 2020 at 12:33 PM Driesprong, Fokko 
> wrote:
>
>> Since we've recently dropped support for Python <=3.5
>> , I think it would be nice
>> to add support for type annotations. Having this in the main repository
>> allows us to do type checking using MyPy  in the
>> CI itself. 
>>
>> This is now handled by the Stub file:
>> https://www.python.org/dev/peps/pep-0484/#stub-files However I think it
>> is nicer to integrate the types with the code itself to keep everything in
>> sync, and make it easier for the people who work on the codebase itself. A
>> first step would be to move the stubs into the codebase. First step would
>> be to cover the public API which is the most important one. Having the
>> types with the code itself makes it much easier to understand. For example,
>> if you can supply a str or column here:
>> https://github.com/apache/spark/pull/29122/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2486
>>
>> One of the implications would be that future PR's on Python should cover
>> annotations on the public API's. Curious what the rest of the community
>> thinks.
>>
>> Cheers, Fokko
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Op di 21 jul. 2020 om 20:04 schreef zero323 :
>>
>>> Given a discussion related to  SPARK-32320 PR
>>>    I'd like to resurrect
>>> this
>>> thread. Is there any interest in migrating annotations to the main
>>> repository?
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


[DISCUSS] Amend the commiter guidelines on the subject of -1s & how we expect PR discussion to be treated.

2020-07-21 Thread Holden Karau
Hi Spark Developers,

There has been a rather active discussion regarding the specific vetoes
that occured during Spark 3. From that I believe we are now mostly in
agreement that it would be best to clarify our rules around code vetoes &
merging in general. Personally I believe this change is important to help
improve the appearance of a level playing field in the project.

Once discussion settles I'll run this by a copy editor, my grammar isn't
amazing, and bring forward for a vote.

The current Spark committer guide is at
https://spark.apache.org/committers.html. I am proposing we add a section
on when it is OK to merge PRs directly above the section on how to merge
PRs. The text I am proposing to amend our committer guidelines with is:

PRs shall not be merged during active on topic discussion except for issues
like critical security fixes of a public vulnerability. Under extenuating
circumstances PRs may be merged during active off topic discussion and the
discussion directed to a more appropriate venue. Time should be given prior
to merging for those involved with the conversation to explain if they
believe they are on topic.

Lazy consensus requires giving time for discussion to settle, while
understanding that people may not be working on Spark as their full time
job and may take holidays. It is believed that by doing this we can limit
how often people feel the need to exercise their veto.

For the purposes of a -1 on code changes, a qualified voter includes all
PMC members and committers in the project. For a -1 to be a valid veto it
must include a technical reason. The reason can include things like the
change may introduce a maintenance burden or is not the direction of Spark.

If there is a -1 from a non-committer, multiple committers or the PMC
should be consulted before moving forward.


If the original person who cast the veto can not be reached in a reasonable
time frame given likely holidays, it is up to the PMC to decide the next
steps within the guidelines of the ASF. This must be decided by a consensus
vote under the ASF voting rules.

These policies serve to reiterate the core principle that code must not be
merged with a pending veto or before a consensus has been reached (lazy or
otherwise).

It is the PMC’s hope that vetoes continue to be infrequent, and when they
occur all parties take the time to build consensus prior to additional
feature work.


Being a committer means exercising your judgement, while working in a
community with diverse views. There is nothing wrong in getting a second
(or 3rd or 4th) opinion when you are uncertain. Thank you for your
dedication to the Spark project, it is appreciated by the developers and
users of Spark.


It is hoped that these guidelines do not slow down development, rather by
removing some of the uncertainty that makes it easier for us to reach
consensus. If you have ideas on how to improve these guidelines, or other
parts of how the Spark project operates you should reach out on the dev@
list to start the discussion.



Kind Regards,

Holden

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread Holden Karau
Yeah I think this could be a great project now that we're only Python 3.5+.
One potential is making this an Outreachy project to get more folks from
different backgrounds involved in Spark.

On Tue, Jul 21, 2020 at 12:33 PM Driesprong, Fokko 
wrote:

> Since we've recently dropped support for Python <=3.5
> , I think it would be nice to
> add support for type annotations. Having this in the main repository allows
> us to do type checking using MyPy  in the CI
> itself. 
>
> This is now handled by the Stub file:
> https://www.python.org/dev/peps/pep-0484/#stub-files However I think it
> is nicer to integrate the types with the code itself to keep everything in
> sync, and make it easier for the people who work on the codebase itself. A
> first step would be to move the stubs into the codebase. First step would
> be to cover the public API which is the most important one. Having the
> types with the code itself makes it much easier to understand. For example,
> if you can supply a str or column here:
> https://github.com/apache/spark/pull/29122/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2486
>
> One of the implications would be that future PR's on Python should cover
> annotations on the public API's. Curious what the rest of the community
> thinks.
>
> Cheers, Fokko
>
>
>
>
>
>
>
>
>
> Op di 21 jul. 2020 om 20:04 schreef zero323 :
>
>> Given a discussion related to  SPARK-32320 PR
>>    I'd like to resurrect this
>> thread. Is there any interest in migrating annotations to the main
>> repository?
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread Driesprong, Fokko
Since we've recently dropped support for Python <=3.5
, I think it would be nice to
add support for type annotations. Having this in the main repository allows
us to do type checking using MyPy  in the CI itself.


This is now handled by the Stub file:
https://www.python.org/dev/peps/pep-0484/#stub-files However I think it is
nicer to integrate the types with the code itself to keep everything in
sync, and make it easier for the people who work on the codebase itself. A
first step would be to move the stubs into the codebase. First step would
be to cover the public API which is the most important one. Having the
types with the code itself makes it much easier to understand. For example,
if you can supply a str or column here:
https://github.com/apache/spark/pull/29122/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2486

One of the implications would be that future PR's on Python should cover
annotations on the public API's. Curious what the rest of the community
thinks.

Cheers, Fokko









Op di 21 jul. 2020 om 20:04 schreef zero323 :

> Given a discussion related to  SPARK-32320 PR
>    I'd like to resurrect this
> thread. Is there any interest in migrating annotations to the main
> repository?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [OSS DIGEST] The major changes of Apache Spark from June 3 to June 16

2020-07-21 Thread Holden Karau
Got it, I missed the date in the reading :)

On Tue, Jul 21, 2020 at 11:23 AM Xingbo Jiang  wrote:

> Hi Holden,
>
> This is the digest for commits merged between *June 3 and June 16.* The
> commits you mentioned would be included in the future digests.
>
> Cheers,
>
> Xingbo
>
> On Tue, Jul 21, 2020 at 11:13 AM Holden Karau 
> wrote:
>
>> I'd also add [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are
>> being shutdown &
>>
>> [SPARK-21040][CORE] Speculate tasks which are running on decommission
>> executors two of the PRs merged after the decommissioning SPIP.
>>
>> On Tue, Jul 21, 2020 at 10:53 AM Xingbo Jiang 
>> wrote:
>>
>>> Hi all,
>>>
>>> This is the bi-weekly Apache Spark digest from the Databricks OSS team.
>>> For each API/configuration/behavior change, an *[API] *tag is added in
>>> the title.
>>>
>>> CORE
>>> [3.0][SPARK-31923][CORE]
>>> Ignore internal accumulators that use unrecognized types rather than
>>> crashing (+63, -5)>
>>> 
>>>
>>> A user may name his accumulators using the internal.metrics. prefix, so
>>> that Spark treats them as internal accumulators and hides them from UI. We
>>> should make JsonProtocol.accumValueToJson more robust and let it ignore
>>> internal accumulators that use unrecognized types.
>>>
>>> [API][3.1][SPARK-31486][CORE]
>>> spark.submit.waitAppCompletion flag to control spark-submit exit in
>>> Standalone Cluster Mode (+88, -26)>
>>> 
>>>
>>> This PR implements an application wait mechanism that allows
>>> spark-submit to wait until the application finishes in Standalone mode.
>>> This will delay the exit of spark-submit JVM until the job is
>>> completed. This implementation will keep monitoring the application until
>>> it is either finished, failed, or killed. This will be controlled via the
>>> following conf:
>>>
>>>-
>>>
>>>spark.standalone.submit.waitAppCompletion (Default: false)
>>>
>>>In standalone cluster mode, controls whether the client waits to
>>>exit until the application completes. If set to true, the client
>>>process will stay alive polling the driver's status. Otherwise, the 
>>> client
>>>process will exit after submission.
>>>
>>>
>>> 
>>> SQL
>>> [3.0][SPARK-31220][SQL]
>>> repartition obeys initialPartitionNum when adaptiveExecutionEnabled (+27,
>>> -12)>
>>> 
>>>
>>> AQE and non-AQE use different configs to set the initial shuffle
>>> partition number. This PR fixes repartition/DISTRIBUTE BY so that it
>>> also uses the AQE config
>>> spark.sql.adaptive.coalescePartitions.initialPartitionNum to set the
>>> initial shuffle partition number if AQE is enabled.
>>>
>>> [3.0][SPARK-31867][SQL][FOLLOWUP]
>>> Check result differences for datetime formatting (+51, -8)>
>>> 
>>>
>>> Spark should throw SparkUpgradeException when getting DateTimeException for
>>> datetime formatting in the EXCEPTION legacy Time Parser Policy.
>>>
>>> [API][3.0][SPARK-31879][SPARK-31892][SQL]
>>> Disable week-based pattern letters in datetime parsing/formatting (+1421,
>>> -171)>
>>> 
>>>  (+102,
>>> -48)>
>>> 
>>>
>>> Week-based pattern letters have very weird behaviors during datetime
>>> parsing in Spark 2.4, and it's very hard to simulate the legacy behaviors
>>> with the new API. For formatting, the new API makes the start-of-week
>>> localized, and it's not possible to keep the legacy behaviors. Since the
>>> week-based fields are rarely used, we disable week-based pattern letters in
>>> both parsing and formatting.
>>>
>>> 

Re: [OSS DIGEST] The major changes of Apache Spark from June 3 to June 16

2020-07-21 Thread Xingbo Jiang
Hi Holden,

This is the digest for commits merged between *June 3 and June 16.* The
commits you mentioned would be included in the future digests.

Cheers,

Xingbo

On Tue, Jul 21, 2020 at 11:13 AM Holden Karau  wrote:

> I'd also add [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are
> being shutdown &
>
> [SPARK-21040][CORE] Speculate tasks which are running on decommission
> executors two of the PRs merged after the decommissioning SPIP.
>
> On Tue, Jul 21, 2020 at 10:53 AM Xingbo Jiang 
> wrote:
>
>> Hi all,
>>
>> This is the bi-weekly Apache Spark digest from the Databricks OSS team.
>> For each API/configuration/behavior change, an *[API] *tag is added in
>> the title.
>>
>> CORE
>> [3.0][SPARK-31923][CORE]
>> Ignore internal accumulators that use unrecognized types rather than
>> crashing (+63, -5)>
>> 
>>
>> A user may name his accumulators using the internal.metrics. prefix, so
>> that Spark treats them as internal accumulators and hides them from UI. We
>> should make JsonProtocol.accumValueToJson more robust and let it ignore
>> internal accumulators that use unrecognized types.
>>
>> [API][3.1][SPARK-31486][CORE]
>> spark.submit.waitAppCompletion flag to control spark-submit exit in
>> Standalone Cluster Mode (+88, -26)>
>> 
>>
>> This PR implements an application wait mechanism that allows spark-submit to
>> wait until the application finishes in Standalone mode. This will delay the
>> exit of spark-submit JVM until the job is completed. This implementation
>> will keep monitoring the application until it is either finished, failed,
>> or killed. This will be controlled via the following conf:
>>
>>-
>>
>>spark.standalone.submit.waitAppCompletion (Default: false)
>>
>>In standalone cluster mode, controls whether the client waits to exit
>>until the application completes. If set to true, the client process
>>will stay alive polling the driver's status. Otherwise, the client process
>>will exit after submission.
>>
>>
>> 
>> SQL
>> [3.0][SPARK-31220][SQL]
>> repartition obeys initialPartitionNum when adaptiveExecutionEnabled (+27,
>> -12)>
>> 
>>
>> AQE and non-AQE use different configs to set the initial shuffle
>> partition number. This PR fixes repartition/DISTRIBUTE BY so that it
>> also uses the AQE config
>> spark.sql.adaptive.coalescePartitions.initialPartitionNum to set the
>> initial shuffle partition number if AQE is enabled.
>>
>> [3.0][SPARK-31867][SQL][FOLLOWUP]
>> Check result differences for datetime formatting (+51, -8)>
>> 
>>
>> Spark should throw SparkUpgradeException when getting DateTimeException for
>> datetime formatting in the EXCEPTION legacy Time Parser Policy.
>>
>> [API][3.0][SPARK-31879][SPARK-31892][SQL]
>> Disable week-based pattern letters in datetime parsing/formatting (+1421,
>> -171)>
>> 
>>  (+102,
>> -48)>
>> 
>>
>> Week-based pattern letters have very weird behaviors during datetime
>> parsing in Spark 2.4, and it's very hard to simulate the legacy behaviors
>> with the new API. For formatting, the new API makes the start-of-week
>> localized, and it's not possible to keep the legacy behaviors. Since the
>> week-based fields are rarely used, we disable week-based pattern letters in
>> both parsing and formatting.
>>
>> [3.0][SPARK-31896][SQL]
>> Handle am-pm timestamp parsing when hour is missing (+39, -3)>
>> 

Re: [OSS DIGEST] The major changes of Apache Spark from June 3 to June 16

2020-07-21 Thread Holden Karau
I'd also add [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are
being shutdown &

[SPARK-21040][CORE] Speculate tasks which are running on decommission
executors two of the PRs merged after the decommissioning SPIP.

On Tue, Jul 21, 2020 at 10:53 AM Xingbo Jiang  wrote:

> Hi all,
>
> This is the bi-weekly Apache Spark digest from the Databricks OSS team.
> For each API/configuration/behavior change, an *[API] *tag is added in
> the title.
>
> CORE
> [3.0][SPARK-31923][CORE]
> Ignore internal accumulators that use unrecognized types rather than
> crashing (+63, -5)>
> 
>
> A user may name his accumulators using the internal.metrics. prefix, so
> that Spark treats them as internal accumulators and hides them from UI. We
> should make JsonProtocol.accumValueToJson more robust and let it ignore
> internal accumulators that use unrecognized types.
>
> [API][3.1][SPARK-31486][CORE]
> spark.submit.waitAppCompletion flag to control spark-submit exit in
> Standalone Cluster Mode (+88, -26)>
> 
>
> This PR implements an application wait mechanism that allows spark-submit to
> wait until the application finishes in Standalone mode. This will delay the
> exit of spark-submit JVM until the job is completed. This implementation
> will keep monitoring the application until it is either finished, failed,
> or killed. This will be controlled via the following conf:
>
>-
>
>spark.standalone.submit.waitAppCompletion (Default: false)
>
>In standalone cluster mode, controls whether the client waits to exit
>until the application completes. If set to true, the client process
>will stay alive polling the driver's status. Otherwise, the client process
>will exit after submission.
>
>
> 
> SQL
> [3.0][SPARK-31220][SQL]
> repartition obeys initialPartitionNum when adaptiveExecutionEnabled (+27,
> -12)>
> 
>
> AQE and non-AQE use different configs to set the initial shuffle partition
> number. This PR fixes repartition/DISTRIBUTE BY so that it also uses the
> AQE config spark.sql.adaptive.coalescePartitions.initialPartitionNum to
> set the initial shuffle partition number if AQE is enabled.
>
> [3.0][SPARK-31867][SQL][FOLLOWUP]
> Check result differences for datetime formatting (+51, -8)>
> 
>
> Spark should throw SparkUpgradeException when getting DateTimeException for
> datetime formatting in the EXCEPTION legacy Time Parser Policy.
>
> [API][3.0][SPARK-31879][SPARK-31892][SQL]
> Disable week-based pattern letters in datetime parsing/formatting (+1421,
> -171)>
> 
>  (+102,
> -48)>
> 
>
> Week-based pattern letters have very weird behaviors during datetime
> parsing in Spark 2.4, and it's very hard to simulate the legacy behaviors
> with the new API. For formatting, the new API makes the start-of-week
> localized, and it's not possible to keep the legacy behaviors. Since the
> week-based fields are rarely used, we disable week-based pattern letters in
> both parsing and formatting.
>
> [3.0][SPARK-31896][SQL]
> Handle am-pm timestamp parsing when hour is missing (+39, -3)>
> 
>
> This PR sets the hour field to 0 or 12 when the AMPM_OF_DAY field is AM or
> PM during datetime parsing, to keep the behavior the same as Spark 2.4.
>
> 

Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread zero323
Given a discussion related to  SPARK-32320 PR
   I'd like to resurrect this
thread. Is there any interest in migrating annotations to the main
repository?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread zero323
Given a discussion related to  SPARK-32320 PR
   I'd like to resurrect this
thread.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[OSS DIGEST] The major changes of Apache Spark from June 3 to June 16

2020-07-21 Thread Xingbo Jiang
Hi all,

This is the bi-weekly Apache Spark digest from the Databricks OSS team.
For each API/configuration/behavior change, an *[API] *tag is added in the
title.

CORE
[3.0][SPARK-31923][CORE]
Ignore internal accumulators that use unrecognized types rather than
crashing (+63, -5)>


A user may name his accumulators using the internal.metrics. prefix, so
that Spark treats them as internal accumulators and hides them from UI. We
should make JsonProtocol.accumValueToJson more robust and let it ignore
internal accumulators that use unrecognized types.
[API][3.1][SPARK-31486][CORE]
spark.submit.waitAppCompletion flag to control spark-submit exit in
Standalone Cluster Mode (+88, -26)>


This PR implements an application wait mechanism that allows spark-submit to
wait until the application finishes in Standalone mode. This will delay the
exit of spark-submit JVM until the job is completed. This implementation
will keep monitoring the application until it is either finished, failed,
or killed. This will be controlled via the following conf:

   -

   spark.standalone.submit.waitAppCompletion (Default: false)

   In standalone cluster mode, controls whether the client waits to exit
   until the application completes. If set to true, the client process will
   stay alive polling the driver's status. Otherwise, the client process will
   exit after submission.


SQL
[3.0][SPARK-31220][SQL]
repartition obeys initialPartitionNum when adaptiveExecutionEnabled (+27,
-12)>


AQE and non-AQE use different configs to set the initial shuffle partition
number. This PR fixes repartition/DISTRIBUTE BY so that it also uses the
AQE config spark.sql.adaptive.coalescePartitions.initialPartitionNum to set
the initial shuffle partition number if AQE is enabled.
[3.0][SPARK-31867][SQL][FOLLOWUP]
Check result differences for datetime formatting (+51, -8)>


Spark should throw SparkUpgradeException when getting DateTimeException for
datetime formatting in the EXCEPTION legacy Time Parser Policy.
[API][3.0][SPARK-31879][SPARK-31892][SQL]
Disable week-based pattern letters in datetime parsing/formatting (+1421,
-171)>

(+102,
-48)>


Week-based pattern letters have very weird behaviors during datetime
parsing in Spark 2.4, and it's very hard to simulate the legacy behaviors
with the new API. For formatting, the new API makes the start-of-week
localized, and it's not possible to keep the legacy behaviors. Since the
week-based fields are rarely used, we disable week-based pattern letters in
both parsing and formatting.
[3.0][SPARK-31896][SQL]
Handle am-pm timestamp parsing when hour is missing (+39, -3)>


This PR sets the hour field to 0 or 12 when the AMPM_OF_DAY field is AM or
PM during datetime parsing, to keep the behavior the same as Spark 2.4.
[API][3.1][SPARK-31830][SQL]
Consistent error handling for datetime formatting and parsing functions
(+126, -580)>


When parsing/formatting datetime values, it's better to fail fast if the
pattern string is invalid, instead of returning null for each input record.
The formatting 

Re: Catalog API for Partition

2020-07-21 Thread JackyLee
The `partitioning` in `TableCatalog.createTable` is a partition schema for
table, which doesn't contains the partition metadata for an actual
partition. Besides, the actual partition metadata may contains many
partition schema, such as hive partition. 
Thus I created a `TablePartition` to contains the partition metadata which
can be distinguished from Transform and created `Partition Catalog APIs` to
manage partition metadata.

In short, the `TablePartition` contains multiple `Transform` and partition
metadata for an actual partition in hive or other datasource.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Bridging gap between Spark UI and Code

2020-07-21 Thread Michal Sankot
And to be clear. Yes, execution plans show what exactly it's doing. The 
problem is that it's unclear how it's related to the actual Scala/Python 
code.


On 7/21/20 15:45, Michal Sankot wrote:
Yes, the problem is that DAGs only refer to code line (action) that 
inovked it. It doesn't provide information about how individual 
transformations link to the code.


So you can have dozen of stages, each with the same code line which 
invoked it, doing different stuff. And then we guess what it's 
actually doing.



On 7/21/20 15:36, Russell Spitzer wrote:
Have you looked in the DAG visualization? Each block refer to the 
code line invoking it.


For Dataframes the execution plan will let you know explicitly which 
operations are in which stages.


On Tue, Jul 21, 2020, 8:18 AM Michal Sankot 
 wrote:


Hi,
when I analyze and debug our Spark batch jobs executions it's a
pain to
find out how blocks in Spark UI Jobs/SQL tab correspond to the
actual
Scala code that we write and how much time they take. Would there
be a
way to somehow instruct compiler or something and get this
information
into Spark UI?

At the moment linking Spark UI elements with our code is a guess
work
driven by adding and removing lines of code and reruning the job,
which
is tedious. A possibility to make our life easier e.g. by running
Spark
jobs in dedicated debug mode where this information would be
available
would be greatly appreciated. (Though I don't know whether it's
possible
at all).

Thanks,
Michal

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




--


MichalSankot
BigData Engineer
E: michal.san...@voxnest.com 


Re: Bridging gap between Spark UI and Code

2020-07-21 Thread Michal Sankot
Yes, the problem is that DAGs only refer to code line (action) that 
inovked it. It doesn't provide information about how individual 
transformations link to the code.


So you can have dozen of stages, each with the same code line which 
invoked it, doing different stuff. And then we guess what it's actually 
doing.



On 7/21/20 15:36, Russell Spitzer wrote:
Have you looked in the DAG visualization? Each block refer to the code 
line invoking it.


For Dataframes the execution plan will let you know explicitly which 
operations are in which stages.


On Tue, Jul 21, 2020, 8:18 AM Michal Sankot 
 wrote:


Hi,
when I analyze and debug our Spark batch jobs executions it's a
pain to
find out how blocks in Spark UI Jobs/SQL tab correspond to the actual
Scala code that we write and how much time they take. Would there
be a
way to somehow instruct compiler or something and get this
information
into Spark UI?

At the moment linking Spark UI elements with our code is a guess work
driven by adding and removing lines of code and reruning the job,
which
is tedious. A possibility to make our life easier e.g. by running
Spark
jobs in dedicated debug mode where this information would be
available
would be greatly appreciated. (Though I don't know whether it's
possible
at all).

Thanks,
Michal

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




InterpretedUnsafeProjection - error in getElementSize

2020-07-21 Thread Janda Martin
Hi,
  I think that I found error in InterpretedUnsafeProjection::getElementSize. 
This method differs from similar implementation in GenerateUnsafeProjection.

 InterpretedUnsafeProjection::getElementSize - returns wrong size for UDTs. I 
suggest to use similar code from GenerateUnsafeProjection.


Test type:
new ArrayType(new CustomUDT())

CustomUDT with sqlType=StringType



  Thank you
 Martin


InterpretedUnsafeProjection implementation:

  private def getElementSize(dataType: DataType): Int = dataType match {
case NullType | StringType | BinaryType | CalendarIntervalType |
 _: DecimalType | _: StructType | _: ArrayType | _: MapType => 8
case _ => dataType.defaultSize
  }


GenerateUnsafeProjection implementation:

val et = UserDefinedType.sqlType(elementType)
...

val elementOrOffsetSize = et match {
  case t: DecimalType if t.precision <= Decimal.MAX_LONG_DIGITS => 8
  case _ if CodeGenerator.isPrimitiveType(jt) => et.defaultSize
  case _ => 8  // we need 8 bytes to store offset and length
}



PS Following line is not necessary because DecimalType is not primitive type - 
so it should be covered by default size=8.

  case t: DecimalType if t.precision <= Decimal.MAX_LONG_DIGITS => 8

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Bridging gap between Spark UI and Code

2020-07-21 Thread Russell Spitzer
Have you looked in the DAG visualization? Each block refer to the code line
invoking it.

For Dataframes the execution plan will let you know explicitly which
operations are in which stages.

On Tue, Jul 21, 2020, 8:18 AM Michal Sankot
 wrote:

> Hi,
> when I analyze and debug our Spark batch jobs executions it's a pain to
> find out how blocks in Spark UI Jobs/SQL tab correspond to the actual
> Scala code that we write and how much time they take. Would there be a
> way to somehow instruct compiler or something and get this information
> into Spark UI?
>
> At the moment linking Spark UI elements with our code is a guess work
> driven by adding and removing lines of code and reruning the job, which
> is tedious. A possibility to make our life easier e.g. by running Spark
> jobs in dedicated debug mode where this information would be available
> would be greatly appreciated. (Though I don't know whether it's possible
> at all).
>
> Thanks,
> Michal
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Bridging gap between Spark UI and Code

2020-07-21 Thread Michal Sankot

Hi,
when I analyze and debug our Spark batch jobs executions it's a pain to 
find out how blocks in Spark UI Jobs/SQL tab correspond to the actual 
Scala code that we write and how much time they take. Would there be a 
way to somehow instruct compiler or something and get this information 
into Spark UI?


At the moment linking Spark UI elements with our code is a guess work 
driven by adding and removing lines of code and reruning the job, which 
is tedious. A possibility to make our life easier e.g. by running Spark 
jobs in dedicated debug mode where this information would be available 
would be greatly appreciated. (Though I don't know whether it's possible 
at all).


Thanks,
Michal

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-07-21 Thread Steve Loughran
On Sun, 12 Jul 2020 at 01:45, gpongracz  wrote:

> As someone who mainly operates in AWS it would be very welcome to have the
> option to use an updated version of hadoop using pyspark sourced from pypi.
>
> Acknowledging the issues of backwards compatability...
>
> The most vexing issue is the lack of ability to use s3a STS, ie
> org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider.
>
> This prevents the use of AWS temporary credentials, hampering local
> development against s3.
>
> I'd personally worry about other issue related to performance, security,
Joda Time and java 8, etc. Hadoop 2.7.x is EOL and doesn't get security
fixes any more.

If you do want that temporary credentials provider -you can stick a copy of
the class on your classpath and just list it on the
option fs.s3a.aws.credentials.provider


> Whilst this would be solved by bumping the hadoop version to anything >=
> 2.8.x, the 3.x option would also allow for the writing of data using KMS.
>
> Regards,
>
> George Pongracz
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: java.lang.ClassNotFoundException for s3a comitter

2020-07-21 Thread Steve Loughran
On Tue, 7 Jul 2020 at 03:42, Stephen Coy 
wrote:

> Hi Steve,
>
> While I understand your point regarding the mixing of Hadoop jars, this
> does not address the java.lang.ClassNotFoundException.
>
> Prebuilt Apache Spark 3.0 builds are only available for Hadoop 2.7 or
> Hadoop 3.2. Not Hadoop 3.1.
>

sorry, I should have been clearer. Hadoop 3.2.x has everything you need.



>
> The only place that I have found that missing class is in the Spark
> “hadoop-cloud” source module, and currently the only way to get the jar
> containing it is to build it yourself. If any of the devs are listening it
>  would be nice if this was included in the standard distribution. It has a
> sizeable chunk of a repackaged Jetty embedded in it which I find a bit odd.
>
> But I am relatively new to this stuff so I could be wrong.
>
> I am currently running Spark 3.0 clusters with no HDFS. Spark is set up
> like:
>
> hadoopConfiguration.set("spark.hadoop.fs.s3a.committer.name",
> "directory");
> hadoopConfiguration.set("spark.sql.sources.commitProtocolClass",
> "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol");
> hadoopConfiguration.set("spark.sql.parquet.output.committer.class",
> "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter");
> hadoopConfiguration.set("fs.s3a.connection.maximum",
> Integer.toString(coreCount * 2));
>
> Querying and updating s3a data sources seems to be working ok.
>
> Thanks,
>
> Steve C
>
> On 29 Jun 2020, at 10:34 pm, Steve Loughran 
> wrote:
>
> you are going to need hadoop-3.1 on your classpath, with hadoop-aws and
> the same aws-sdk it was built with (1.11.something). Mixing hadoop JARs is
> doomed. using a different aws sdk jar is a bit risky, though more recent
> upgrades have all be fairly low stress
>
> On Fri, 19 Jun 2020 at 05:39, murat migdisoglu 
> wrote:
>
>> Hi all
>> I've upgraded my test cluster to spark 3 and change my comitter to
>> directory and I still get this error.. The documentations are somehow
>> obscure on that.
>> Do I need to add a third party jar to support new comitters?
>>
>> java.lang.ClassNotFoundException:
>> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
>>
>>
>> On Thu, Jun 18, 2020 at 1:35 AM murat migdisoglu <
>> murat.migdiso...@gmail.com> wrote:
>>
>>> Hello all,
>>> we have a hadoop cluster (using yarn) using  s3 as filesystem with
>>> s3guard is enabled.
>>> We are using hadoop 3.2.1 with spark 2.4.5.
>>>
>>> When I try to save a dataframe in parquet format, I get the following
>>> exception:
>>> java.lang.ClassNotFoundException:
>>> com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
>>>
>>> My relevant spark configurations are as following:
>>>
>>> "hadoop.mapreduce.outputcommitter.factory.scheme.s3a":"org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory",
>>> "fs.s3a.committer.name
>>> ":
>>> "magic",
>>> "fs.s3a.committer.magic.enabled": true,
>>> "fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
>>>
>>> While spark streaming fails with the exception above, apache beam
>>> succeeds writing parquet files.
>>> What might be the problem?
>>>
>>> Thanks in advance
>>>
>>>
>>> --
>>> "Talkers aren’t good doers. Rest assured that we’re going there to use
>>> our hands, not our tongues."
>>> W. Shakespeare
>>>
>>
>>
>> --
>> "Talkers aren’t good doers. Rest assured that we’re going there to use
>> our hands, not our tongues."
>> W. Shakespeare
>>
>
>
> 
> This email contains confidential information of and is the copyright of
> Infomedia. It must not be forwarded, amended or disclosed without consent
> of the sender. If you received this message by mistake, please advise the
> sender and delete all copies. Security of transmission on the internet
> cannot be guaranteed, could be infected, intercepted, or corrupted and you
> should ensure you have suitable antivirus protection in place. By sending
> us your or any third party personal details, you consent to (or confirm you
> have obtained consent from such third parties) to Infomedia’s privacy
> policy. http://www.infomedia.com.au/privacy-policy/
>