Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Maciej

+1

Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 4/25/24 6:21 PM, Reynold Xin wrote:

+1

On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale 
 wrote:


+1

On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun
 wrote:

FYI, there is a proposal to drop Python 3.8 because its EOL is
October 2024.

https://github.com/apache/spark/pull/46228
[SPARK-47993][PYTHON] Drop Python 3.8

Since it's still alive and there will be an overlap between
the lifecycle of Python 3.8 and Apache Spark 4.0.0, please
give us your feedback on the PR, if you have any concerns.

From my side, I agree with this decision.

Thanks,
Dongjoon.


Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Maciej

+1

Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 4/15/24 8:16 PM, Rui Wang wrote:

+1, non-binding.

Thanks Dongjoon to drive this!


-Rui

On Mon, Apr 15, 2024 at 10:10 AM Xinrong Meng  wrote:

+1

Thank you @Dongjoon Hyun <mailto:dongjoon.h...@gmail.com> !

On Mon, Apr 15, 2024 at 6:33 AM beliefer  wrote:

+1


在 2024-04-15 15:54:07,"Peter Toth"  写道:

+1

Wenchen Fan  ezt írta (időpont: 2024.
ápr. 15., H, 9:08):

+1

On Sun, Apr 14, 2024 at 6:28 AM Dongjoon Hyun
 wrote:

I'll start from my +1.

Dongjoon.

On 2024/04/13 22:22:05 Dongjoon Hyun wrote:
> Please vote on SPARK-4 to use ANSI SQL mode
by default.
> The technical scope is defined in the following
PR which is
> one line of code change and one line of
migration guide.
>
> - DISCUSSION:
>

https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz
> - JIRA:
https://issues.apache.org/jira/browse/SPARK-4
> - PR: https://github.com/apache/spark/pull/46013
>
> The vote is open until April 17th 1AM (PST) and
passes
> if a majority +1 PMC votes are cast, with a
minimum of 3 +1 votes.
>
> [ ] +1 Use ANSI SQL mode by default
> [ ] -1 Do not use ANSI SQL mode by default
because ...
>
> Thank you in advance.
>
> Dongjoon
>


-
To unsubscribe e-mail:
dev-unsubscr...@spark.apache.org


Re: [VOTE] Updating documentation hosted for EOL and maintenance releases

2023-09-26 Thread Maciej

+1

Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 9/26/23 17:12, Michel Miotto Barbosa wrote:

+1

A disposição | At your disposal

Michel Miotto Barbosa
https://www.linkedin.com/in/michelmiottobarbosa/
mmiottobarb...@gmail.com
+55 11 984 342 347




On Tue, Sep 26, 2023 at 11:44 AM Herman van Hovell 
 wrote:


+1

On Tue, Sep 26, 2023 at 10:39 AM yangjie01
 wrote:

+1

*发件人**: *Yikun Jiang 
*日期**: *2023年9月26日星期二18:06
*收件人**: *dev 
*抄送**: *Hyukjin Kwon , Ruifeng Zheng

*主题**: *Re: [VOTE] Updating documentation hosted for EOL and
maintenance releases

+1, I believe it is a wise choice to update the EOL policy of
the document based on the real demands of community users.


Regards,

Yikun

On Tue, Sep 26, 2023 at 1:06 PM Ruifeng Zheng
 wrote:

+1

On Tue, Sep 26, 2023 at 12:51 PM Hyukjin Kwon
 wrote:

Hi all,

I would like to start the vote for updating
documentation hosted for EOL and maintenance releases
to improve the usability here, and in order for end
users to read the proper and correct documentation.


For discussion thread, please refer to
https://lists.apache.org/thread/1675rzxx5x4j2x03t9x0kfph8tlys0cx

<https://mailshield.baidu.com/check?q=psBWQDVygkjEt8RmyKGeMsDdWWxpGUjEHhME9NURW3gPdysSEKSVoMVfxuGGFZnq%2bsw8DjYXYHs7hOLgKPhUarLAsMM%3d>.


Here is one example:
- https://github.com/apache/spark/pull/42989

<https://mailshield.baidu.com/check?q=%2fr2M6fgceYem%2fMGu3%2fx4rbmS3p%2bGNLl0PbMox02XV6k1lCjNAwYj71MwHFS6b6jg>

- https://github.com/apache/spark-website/pull/480

<https://mailshield.baidu.com/check?q=CDZ3ql2XnqmkZORH6s5l6QHUyBwMb0MXHbWMY1JreG0frh1zvDFqgMNhYXthUTy8iEqydA%3d%3d>

Starting with my own +1.



OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: LLM script for error message improvement

2023-08-04 Thread Maciej
Besides, in case a separate discussion doesn't happen, our core 
responsibility is to follow the ASF guidelines, including the ASF 
Generative Tooling Guidance 
(https://www.apache.org/legal/generative-tooling.html).


As far as I understand it, both the first (which explicitly mentions 
ChatGPT) and the third acceptance conditions are not satisfied by this 
and the other mentioned PR.


On a side note, we should probably take a closer look at the following

'When providing contributions authored using generative AI tooling, a 
recommended practice is for contributors to indicate the tooling used to 
create the contribution. This should be included as a token in the 
source control commit message, for example including the phrase 
“Generated-by: ”.'


and consider adjusting PR template / merge tool accordingly.

Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 8/3/23 22:14, Maciej wrote:
I am sitting on the fence about that. In the linked PR Xiao wrote the 
following
>We published the error guideline a few years ago, but not all 
contributors adhered to it, resulting in variable quality in error 
messages.
If a policy exists but is not enforced (if that's indeed the case, I 
didn't go through the source to confirm that) it might be useful to 
learn the reasons why it happens. Normally, I'd expect
-Policy is too complex to enforce. In such case, additional tooling 
can be useful.
-Policy is not well known, and the people responsible for introducing 
it are not committed to enforcing it.
-Policy or some of its components don't really reflect community 
values and expectations.
If the problem of suspected violations was never raised on our 
standard communication channel, and as far as I can tell, it has not, 
then introducing a new tool to enforce the policy seems a bit premature.
If these were the only considerations, I'd say that improving the 
overall consistency of the project outweighs possible risks, even if 
the case for such might be poorly supported.
However, there is an elephant in the room. It is another attempt, 
after SPARK-44546, to embed generative tools directly within the Spark 
dev workflow. By principle, I am not against such tools. In fact, it 
is pretty clear that they are already used by Spark committers, and 
even if we wanted to, there is little we can do to prevent that. In 
such cases, decisions which tools, if any, to use, to what extent and 
how to treat their output are the sole responsibility of contributors.
In contrast, these proposals try to push a proprietary tool burdened 
with serious privacy and ethical issues and likely to introduce 
unclear liabilities as a standard or even required developer tool.
I can't speak for others, but personally, I'm quite uneasy about it. 
If we go this way, I strongly believe that it should be preceded by a 
serious discussion, if not the development of a formal policy, about 
what categories of tools, to what capacity, to what extent are 
acceptable within the project. Ideally, with an official opinion from 
the ASF as the copyright owner.

WDYT All? Shall we start a separate discussion?
Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC
On 8/3/23 18:33, Haejoon Lee wrote:


Additional information:

Please check https://issues.apache.org/jira/browse/SPARK-37935if you 
want to start contributing to improving error messages.


You can create sub-tasks if you believe there are error messages that 
need improvement, in addition to the tasks listed in the umbrella JIRA.


You can also refer to https://github.com/apache/spark/pull/41504, 
https://github.com/apache/spark/pull/41455as an example PR.



On Thu, Aug 3, 2023 at 1:10 PM Ruifeng Zheng  wrote:

+1 from my side, I'm fine to have it as a helper script

On Thu, Aug 3, 2023 at 10:53 AM Hyukjin Kwon
 wrote:

I think adding that dev tool script to improve the error
message is fine.

On Thu, 3 Aug 2023 at 10:24, Haejoon Lee
 wrote:

Dear contributors, I hope you are doing well!

I see there are contributors who are interested in
working on error message improvements and persistent
contribution, so I want to share an llm-based error
message improvement script for helping your contribution.

You can find a detail for the script at
https://github.com/apache/spark/pull/41711. I believe
this can help your error message improvement work, so I
encourage you to take a look at the pull request and
leverage the script.

Please let me know if you have any questions or concerns.

Thanks all for your time and contributions!

Best regards,

Haejoon



OpenPGP_signature
Description: OpenPGP digital signature


Re: LLM script for error message improvement

2023-08-03 Thread Maciej
I am sitting on the fence about that. In the linked PR Xiao wrote the 
following
We published the error guideline a few years ago, but not all 

contributors adhered to it, resulting in variable quality in error messages.
If a policy exists but is not enforced (if that's indeed the case, I 
didn't go through the source to confirm that) it might be useful to 
learn the reasons why it happens. Normally, I'd expect
-Policy is too complex to enforce. In such case, additional tooling can 
be useful.
-Policy is not well known, and the people responsible for introducing it 
are not committed to enforcing it.
-Policy or some of its components don't really reflect community values 
and expectations.
If the problem of suspected violations was never raised on our standard 
communication channel, and as far as I can tell, it has not, then 
introducing a new tool to enforce the policy seems a bit premature.
If these were the only considerations, I'd say that improving the 
overall consistency of the project outweighs possible risks, even if the 
case for such might be poorly supported.
However, there is an elephant in the room. It is another attempt, after 
SPARK-44546, to embed generative tools directly within the Spark dev 
workflow. By principle, I am not against such tools. In fact, it is 
pretty clear that they are already used by Spark committers, and even if 
we wanted to, there is little we can do to prevent that. In such cases, 
decisions which tools, if any, to use, to what extent and how to treat 
their output are the sole responsibility of contributors.
In contrast, these proposals try to push a proprietary tool burdened 
with serious privacy and ethical issues and likely to introduce unclear 
liabilities as a standard or even required developer tool.
I can't speak for others, but personally, I'm quite uneasy about it. If 
we go this way, I strongly believe that it should be preceded by a 
serious discussion, if not the development of a formal policy, about 
what categories of tools, to what capacity, to what extent are 
acceptable within the project. Ideally, with an official opinion from 
the ASF as the copyright owner.

WDYT All? Shall we start a separate discussion?

Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 8/3/23 18:33, Haejoon Lee wrote:


Additional information:

Please check https://issues.apache.org/jira/browse/SPARK-37935if you 
want to start contributing to improving error messages.


You can create sub-tasks if you believe there are error messages that 
need improvement, in addition to the tasks listed in the umbrella JIRA.


You can also refer to https://github.com/apache/spark/pull/41504, 
https://github.com/apache/spark/pull/41455as an example PR.



On Thu, Aug 3, 2023 at 1:10 PM Ruifeng Zheng  wrote:

+1 from my side, I'm fine to have it as a helper script

On Thu, Aug 3, 2023 at 10:53 AM Hyukjin Kwon
 wrote:

I think adding that dev tool script to improve the error
message is fine.

On Thu, 3 Aug 2023 at 10:24, Haejoon Lee
 wrote:

Dear contributors, I hope you are doing well!

I see there are contributors who are interested in working
on error message improvements and persistent contribution,
so I want to share an llm-based error message improvement
script for helping your contribution.

You can find a detail for the script at
https://github.com/apache/spark/pull/41711. I believe this
can help your error message improvement work, so I
encourage you to take a look at the pull request and
leverage the script.

Please let me know if you have any questions or concerns.

Thanks all for your time and contributions!

Best regards,

Haejoon



OpenPGP_signature
Description: OpenPGP digital signature


Re: [VOTE] SPIP: XML data source support

2023-07-29 Thread Maciej

+1

Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 7/29/23 11:28, Mich Talebzadeh wrote:

+1 for me.

Though Databriks did a good job releasing the code.

GitHub - databricks/spark-xml: XML data source for Spark SQL and 
DataFrames <https://github.com/databricks/spark-xml>



Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


**view my Linkedin profile 
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk.Any and all responsibility for 
any loss, damage or destruction of data or any other property which 
may arise from relying on this email's technical content is explicitly 
disclaimed. The author will in no case be liable for any monetary 
damages arising from such loss, damage or destruction.




On Sat, 29 Jul 2023 at 06:34, Jia Fan  wrote:


+ 1



2023年7月29日 13:06,Adrian Pop-Tifrea 
写道:

+1, the more data source formats, the better, and if the solution
is already thoroughly tested, I say we should go for it.

On Sat, Jul 29, 2023, 06:35 Xiao Li  wrote:

+1

On Fri, Jul 28, 2023 at 15:54 Sean Owen  wrote:

+1 I think that porting the package 'as is' into Spark is
probably worthwhile.
That's relatively easy; the code is already pretty
battle-tested and not that big and even originally came
from Spark code, so is more or less similar already.

One thing it never got was DSv2 support, which means XML
reading would still be somewhat behind other formats. (I
was not able to implement it.)
This isn't a necessary goal right now, but would be
possibly part of the logic of moving it into the Spark
code base.

On Fri, Jul 28, 2023 at 5:38 PM Sandip Agarwala
 wrote:

Dear Spark community,

I would like to start the vote for "SPIP: XML data
source support".

XML is a widely used data format. An external
spark-xml package
(https://github.com/databricks/spark-xml) is
available to read and write XML data in spark. Making
spark-xml built-in will provide a better user
experience for Spark SQL and structured streaming.
The proposal is to inline code from the spark-xml
package.

SPIP link:

https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit?usp=sharing

JIRA:
https://issues.apache.org/jira/browse/SPARK-44265

Discussion Thread:
https://lists.apache.org/thread/q32hxgsp738wom03mgpg9ykj9nr2n1fh

Please vote on the SPIP for the next 72 hours:
[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because __.

Thanks, Sandip





OpenPGP_signature
Description: OpenPGP digital signature


Re: [DISCUSS] SPIP: XML data source support

2023-07-19 Thread Maciej
That's a great idea, as long as we can keep additional dependencies 
under control.


Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 7/19/23 18:22, Franco Patano wrote:

+1

Many people have struggled with incorporating this separate library 
into their Spark pipelines.


On Wed, Jul 19, 2023 at 10:53 AM Burak Yavuz  wrote:

+1 on adding to Spark. Community involvement will make the XML
reader better.

Best,
Burak

On Wed, Jul 19, 2023 at 3:25 AM Martin Andersson
 wrote:

Alright, makes sense to add it then.

*From:* Hyukjin Kwon 
*Sent:* Wednesday, July 19, 2023 11:01
*To:* Martin Andersson 
*Cc:* Sandip Agarwala ;
dev@spark.apache.org 
*Subject:* Re: [DISCUSS] SPIP: XML data source support
EXTERNAL SENDER. Do not click links or open attachments unless
you recognize the sender and know the content is safe. DO NOT
provide your username or password.

Here are the benefits of having it as a built-in source:

  * We can leverage the community to improve the Spark XML
(not within Databricks repositories).
  * We can share the same core for XML expressions (e.g.,
from_xml and to_xml like from_csv, from_json, etc.).
  * It is more to embrace the commonly used datasource, just
like the existing builtin data sources we have.
 *

Users wouldn't have to set the jars or maven coordinates,
e.g., for now, if they have network problems, etc, it
would be harder to use them by default.

XML is arguably more used than CSV that is already our
built-in source, see e.g.,
https://insights.stackoverflow.com/trends?tags=xml%2Cjson%2Ccsv
and

https://www.reddit.com/r/programming/comments/bak5qt/a_comparison_of_serialization_formats_csv_json/


On Wed, 19 Jul 2023 at 17:51, Martin Andersson
 wrote:

How much of an effort is it to use the spark-xml library
today? What's the drawback to keeping this as an external
library as-is?

Best Regards, Martin


*From:* Hyukjin Kwon 
*Sent:* Wednesday, July 19, 2023 01:27
*To:* Sandip Agarwala 
*Cc:* dev@spark.apache.org 
*Subject:* Re: [DISCUSS] SPIP: XML data source support
EXTERNAL SENDER. Do not click links or open attachments
unless you recognize the sender and know the content is
safe. DO NOT provide your username or password.

Yeah I support this. XML is pretty outdated format TBH but
still used in many legacy systems. For example, Wikipedia
dump is one case.

Even when you take a look from stats CVS vs XML vs JSON,
some show that XML is more used in CSV.

On Wed, Jul 19, 2023 at 12:58 AM Sandip Agarwala
 wrote:

Dear Spark community,

I would like to start a discussion on "XML data source
support".

XML is a widely used data format. An external
spark-xml package
(https://github.com/databricks/spark-xml) is available
to read and write XML data in spark. Making spark-xml
built-in will provide a better user experience for
Spark SQL and structured streaming. The proposal is to
inline code from the spark-xml package.
I am collaborating with Hyukjin Kwon, who is the
original author of spark-xml, for this effort.

SPIP link:

https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit?usp=sharing

JIRA:
https://issues.apache.org/jira/browse/SPARK-44265

Looking forward to your feedback.
Thanks, Sandip



OpenPGP_signature
Description: OpenPGP digital signature


Re: [VOTE][SPIP] Python Data Source API

2023-07-06 Thread Maciej

+0

Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 7/6/23 17:41, Xiao Li wrote:

+1

Xiao

Hyukjin Kwon  于2023年7月5日周三 17:28写道:

+1.

See https://youtu.be/yj7XlTB1Jvc?t=604 :-).

On Thu, 6 Jul 2023 at 09:15, Allison Wang
 wrote:

Hi all,

I'd like to start the vote for SPIP: Python Data Source API.

The high-level summary for the SPIP is that it aims to
introduce a simple API in Python for Data Sources. The idea is
to enable Python developers to create data sources without
learning Scala or dealing with the complexities of the current
data source APIs. This would make Spark more accessible to the
wider Python developer community.

References:

  * SPIP doc

<https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing>
  * JIRA ticket
<https://issues.apache.org/jira/browse/SPARK-44076>
  * Discussion thread
<https://lists.apache.org/thread/w621zn14ho4rw61b0s139klnqh900s8y>


Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because __.

Thanks,
Allison



OpenPGP_signature
Description: OpenPGP digital signature


Re: [DISCUSS] SPIP: Python Data Source API

2023-06-25 Thread Maciej

Thanks for your feedback Martin.

However, if the primary intended purpose of this API is to provide an 
interface for endpoint querying, then I find this proposal even less 
convincing.


Neither the Spark execution model nor the data source API (full or 
restricted as proposed here) are a good fit for handling problems 
arising from massive endpoint requests, including, but not limited to, 
handling quotas and rate limiting.


Consistency and streamlined development are, of course, valuable. 
Nonetheless, they are not sufficient, especially if they cannot deliver 
the expected user experience in terms of reliability and execution cost.


Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 6/24/23 23:42, Martin Grund wrote:

Hey,

I would like to express my strong support for Python Data Sources even 
though they might not be immediately as powerful as Scala-based data 
sources. One element that is easily lost in this discussion is how 
much faster the iteration speed is with Python compared to Scala. Due 
to the dynamic nature of Python, you can design and build a data 
source while running in a notebook and continuously change the code 
until it works as you want. This behavior is unparalleled!


There exists a litany of Python libraries connecting to all kinds of 
different endpoints that could provide data that is usable with Spark. 
I personally can imagine implementing a data source on top of the AWS 
SDK to extract EC2 instance information. Now I don't have to switch 
tools and can keep my pipeline consistent.


Let's say you want to query an API in parallel from Spark using 
Python, today's way would be to create a Python RDD and then implement 
the planning and execution process manually. Finally calling `toDF` in 
the end. While the actual code of the DS and the RDD-based 
implementation would be very similar, the abstraction that is provided 
by the DS is much more powerful and future-proof. Performing dynamic 
partition elimination, and filter push-down can all be implemented at 
a later point in time.


Comparing a DS to using batch calling from a UDF is not great because, 
the execution pattern would be very brittle. Imagine something like 
`spark.range(10).withColumn("data", 
fetch_api).explode(col("data")).collect()`. Here you're encoding 
partitioning logic and data transformation in simple ways, but you 
can't reason about the structural integrity of the query and tiny 
changes in the UDF interface might already cause a lot of downstream 
issues.



Martin


On Sat, Jun 24, 2023 at 1:44 AM Maciej  wrote:

With such limited scope (both language availability and features)
do we have any representative examples of sources that could
significantly benefit from providing this API,  compared other
available options, such as batch imports, direct queries from
vectorized  UDFs or even interfacing sources through 3rd party FDWs?

    Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 6/20/23 16:23, Wenchen Fan wrote:

In an ideal world, every data source you want to connect to
already has a Spark data source implementation (either v1 or v2),
then this Python API is useless. But I feel it's common that
people want to do quick data exploration, and the target data
system is not popular enough to have an existing Spark data
source implementation. It will be useful if people can quickly
implement a Spark data source using their favorite Python language.

I'm +1 to this proposal, assuming that we will keep it simple and
won't copy all the complicated features we built in DS v2 to this
new Python API.

On Tue, Jun 20, 2023 at 2:11 PM Maciej 
wrote:

Similarly to Jacek, I feel it fails to document an actual
community need for such a feature.

Currently, any data source implementation has the potential
to benefit Spark users across all supported and third-party
clients.  For generally available sources, this is
advantageous for the whole Spark community and avoids
creating 1st and 2nd-tier citizens. This is even more
important with new officially supported languages being added
through connect.

Instead, we might rather document in detail the process of
implementing a new source using current APIs and work towards
easily extensible or customizable sources, in case there is
such a need.

-- 
Best regards,

Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC


On 6/20/23 05:19, Hyukjin Kwon wrote:

Actually I support this idea in a way that Python developers
don't have to learn Scala to write their own source (and
separate packaging).
This is more crucial especially when you want to write a
simple data source that interacts with 

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-24 Thread Maciej
With such limited scope (both language availability and features) do we 
have any representative examples of sources that could significantly 
benefit from providing this API,  compared other available options, such 
as batch imports, direct queries from vectorized  UDFs or even 
interfacing sources through 3rd party FDWs?


Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 6/20/23 16:23, Wenchen Fan wrote:
In an ideal world, every data source you want to connect to already 
has a Spark data source implementation (either v1 or v2), then this 
Python API is useless. But I feel it's common that people want to do 
quick data exploration, and the target data system is not popular 
enough to have an existing Spark data source implementation. It will 
be useful if people can quickly implement a Spark data source using 
their favorite Python language.


I'm +1 to this proposal, assuming that we will keep it simple and 
won't copy all the complicated features we built in DS v2 to this new 
Python API.


On Tue, Jun 20, 2023 at 2:11 PM Maciej  wrote:

Similarly to Jacek, I feel it fails to document an actual
community need for such a feature.

Currently, any data source implementation has the potential to
benefit Spark users across all supported and third-party clients. 
For generally available sources, this is advantageous for the
whole Spark community and avoids creating 1st and 2nd-tier
citizens. This is even more important with new officially
supported languages being added through connect.

Instead, we might rather document in detail the process of
implementing a new source using current APIs and work towards
easily extensible or customizable sources, in case there is such a
need.

-- 
Best regards,

Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC


On 6/20/23 05:19, Hyukjin Kwon wrote:

Actually I support this idea in a way that Python developers
don't have to learn Scala to write their own source (and separate
packaging).
This is more crucial especially when you want to write a simple
data source that interacts with the Python ecosystem.

On Tue, 20 Jun 2023 at 03:08, Denny Lee 
wrote:

Slightly biased, but per my conversations - this would be
awesome to have!

On Mon, Jun 19, 2023 at 09:43 Abdeali Kothari
 wrote:

I would definitely use it - is it's available :)

On Mon, 19 Jun 2023, 21:56 Jacek Laskowski,
 wrote:

Hi Allison and devs,

Although I was against this idea at first sight
(probably because I'm a Scala dev), I think it
could work as long as there are people who'd be
interested in such an API. Were there any? I'm just
curious. I've seen no emails requesting it.

I also doubt that Python devs would like to work on
new data sources but support their wishes
wholeheartedly :)

Pozdrawiam,
Jacek Laskowski

"The Internals Of" Online Books
<https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Fri, Jun 16, 2023 at 6:14 AM Allison Wang

<mailto:allison.w...@databricks.com.invalid> wrote:

Hi everyone,

I would like to start a discussion on “Python
Data Source API”.

This proposal aims to introduce a simple API in
Python for Data Sources. The idea is to enable
Python developers to create data sources without
having to learn Scala or deal with the
complexities of the current data source APIs. The
goal is to make a Python-based API that is simple
and easy to use, thus making Spark more
accessible to the wider Python developer
community. This proposed approach is based on the
recently introduced Python user-defined table
functions with extensions to support data sources.

*SPIP Doc*:

https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing


*SPIP JIRA*:
https://issues.apache.org/jira/browse/SPARK-44076

Looking forward to your feedback.

Thanks,
Allison






OpenPGP_signature
Description: OpenPGP digital signature


Re: [VOTE][SPIP] PySpark Test Framework

2023-06-21 Thread Maciej

+1

--
Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC


On 6/21/23 17:35, Holden Karau wrote:
A small request, it’s pride weekend in San Francisco where some of the 
core developers are and right before one of the larger spark related 
conferences so more folks might be traveling than normal. Could we 
maybe extend the vote out an extra day or two just to give folks a 
chance to be heard?


On Wed, Jun 21, 2023 at 8:30 AM Reynold Xin  wrote:

+1

This is a great idea.


On Wed, Jun 21, 2023 at 8:29 AM, Holden Karau
 wrote:

I’d like to start with a +1, better Python testing tools
integrated into the project make sense.

On Wed, Jun 21, 2023 at 8:11 AM Amanda Liu
 wrote:

Hi all,

I'd like to start the vote for SPIP: PySpark Test Framework.

The high-level summary for the SPIP is that it proposes an
official test framework for PySpark. Currently, there are
only disparate open-source repos and blog posts for
PySpark testing resources. We can streamline and simplify
the testing process by incorporating test features, such
as a PySpark Test Base class (which allows tests to share
Spark sessions) and test util functions (for example,
asserting dataframe and schema equality).

*SPIP doc:*

https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v

*JIRA ticket:*
https://issues.apache.org/jira/browse/SPARK-44042

*Discussion thread:*
https://lists.apache.org/thread/trwgbgn3ycoj8b8k8lkxko2hql23o41n

Please vote on the SPIP for the next 72 hours:
[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because __.

Thank you!

Best,
Amanda Liu

-- 
Twitter: https://twitter.com/holdenkarau

Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): 
https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>

YouTube Live Streams: https://www.youtube.com/user/holdenkarau




OpenPGP_signature
Description: OpenPGP digital signature


Re: [VOTE] Apache Spark PMC asks Databricks to differentiate its Spark version string

2023-06-20 Thread Maciej
member repeating this claim over and
over, without support. This is why we don't do this in public.

May I ask which relevant context you are insisting not to
receive specifically? I gave the specific examples
(UI/logs/screenshot), and got the specific legal advice
from `legal-discuss@` and replied why the version should
be different.


It is the thread I linked in my reply:
https://lists.apache.org/thread/k7gr65wt0fwtldc7hp7bd0vkg1k93rrb
This has already been discussed at length, and you're aware of
it, but, didn't mention it. I think that's critical; your text
contains no problem statement at all by itself.

Since we're here, fine: I vote -1, simply because this states
no reason for the action at all.
If we assume the thread ^^^ above is the extent of the logic,
then, -1 for the following reasons:
- Relevant ASF policy seems to say this is fine, as argued at
https://lists.apache.org/thread/p15tc772j9qwyvn852sh8ksmzrol9cof
- There is no argument any of this has caused a problem for
the community anyway; there is just nothing to 'fix'

    I would again ask we not simply repeat the same thread again.



--
Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC



OpenPGP_signature
Description: OpenPGP digital signature


Re: [DISCUSS] SPIP: Python Data Source API

2023-06-20 Thread Maciej
Similarly to Jacek, I feel it fails to document an actual community need 
for such a feature.


Currently, any data source implementation has the potential to benefit 
Spark users across all supported and third-party clients. For generally 
available sources, this is advantageous for the whole Spark community 
and avoids creating 1st and 2nd-tier citizens. This is even more 
important with new officially supported languages being added through 
connect.


Instead, we might rather document in detail the process of implementing 
a new source using current APIs and work towards easily extensible or 
customizable sources, in case there is such a need.


--
Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC


On 6/20/23 05:19, Hyukjin Kwon wrote:
Actually I support this idea in a way that Python developers don't 
have to learn Scala to write their own source (and separate packaging).
This is more crucial especially when you want to write a simple data 
source that interacts with the Python ecosystem.


On Tue, 20 Jun 2023 at 03:08, Denny Lee  wrote:

Slightly biased, but per my conversations - this would be awesome
to have!

On Mon, Jun 19, 2023 at 09:43 Abdeali Kothari
 wrote:

I would definitely use it - is it's available :)

On Mon, 19 Jun 2023, 21:56 Jacek Laskowski, 
wrote:

Hi Allison and devs,

Although I was against this idea at first sight (probably
because I'm a Scala dev), I think it could work as long as
there are people who'd be interested in such an API. Were
there any? I'm just curious. I've seen no emails
requesting it.

I also doubt that Python devs would like to work on new
data sources but support their wishes wholeheartedly :)

Pozdrawiam,
Jacek Laskowski

"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Fri, Jun 16, 2023 at 6:14 AM Allison Wang
 wrote:

Hi everyone,

I would like to start a discussion on “Python Data
Source API”.

This proposal aims to introduce a simple API in Python
for Data Sources. The idea is to enable Python
developers to create data sources without having to
learn Scala or deal with the complexities of the
current data source APIs. The goal is to make a
Python-based API that is simple and easy to use, thus
making Spark more accessible to the wider Python
developer community. This proposed approach is based
on the recently introduced Python user-defined table
functions with extensions to support data sources.

*SPIP Doc*:

https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing


*SPIP JIRA*:
https://issues.apache.org/jira/browse/SPARK-44076

Looking forward to your feedback.

Thanks,
Allison






OpenPGP_signature
Description: OpenPGP digital signature


Re: [CONNECT] New Clients for Go and Rust

2023-06-01 Thread Maciej

Hi Martin,


On 5/30/23 11:50, Martin Grund wrote:
I think it makes sense to split this discussion into two pieces. On  > the contribution side, my personal perspective is that these new > 
clients are explicitly marked as experimental and unsupported until > we 
deem them mature enough to be supported using the standard release > 
process etc. However, the goal should be that the main contributors > of 
these clients are aiming to follow the same release and > maintenance 
schedule. I think we should encourage the community to > contribute to 
the Spark Connect clients and as such we should > explicitly not make it 
as hard as possible to get started (and for > that reason reserve the 
right to abandon).


I know it sounds like a nitpicking, but we still have components 
deprecated in 1.2 or 1.3, not to mention subprojects that haven't been 
developed for years.  So, there is a huge gap between reserving a right 
and actually exercising it when needed. If such a right is to be used 
differently for Spark Connect bindings, it's something that should be 
communicated upfront.


 > How exactly the release schedule is going to look is going to require 
> probably some experimentation because it's a new area for Spark and > 
it's ecosystem. I don't think it requires us to have all answers > upfront.


Nonetheless, we should work towards establishing consensus around these 
issues and documenting the answers. They affect not only the maintainers 
(see for example a recent discussion about switching to a more 
predictable release schedule) but also the users, for whom multiple APIs 
(including their development status) have been a common source of 
confusion in the past.


Also, an elephant in the room is the future of the current API in  >> Spark 4 and onwards. As useful as connect is, it is not exactly a 
>> replacement for many existing deployments. Furthermore, it doesn't 
>> make extending Spark much easier and the current ecosystem is, >> 
subjectively speaking, a bit brittle. > > The goal of Spark Connect is 
not to replace the way users are > currently deploying Spark, it's not 
meant to be that. Users should > continue deploying Spark in exactly the 
way they prefer. Spark > Connect allows bringing more interactivity and 
connectivity to Spark. > While Spark Connect extends Spark, most new 
language consumers will > not try to extend Spark, but simply provide 
the existing surface to > their native language. So the goal is not so 
much extensibility but > more availability. For example, I believe it 
would be awesome if the > Livy community would find a way to integrate 
with Spark Connect to > provide the routing capabilities to provide a 
stable DNS endpoint for > all different Spark deployments. > >> [...] 
the current ecosystem is, subjectively speaking, a bit >> brittle. > > 
Can you help me understand that a bit better? Do you mean the Spark > 
ecosystem or the Spark Connect ecosystem?


I mean Spark in general. While most of the core and some closely related 
projects are well maintained, tools built on top of Spark, even ones 
supported by major stakeholders, are often short-lived and left 
unmaintained, if not officially abandoned.


New languages aside, without a single extension point (which, for core 
Spark is JVM interface), maintaining public projects on top of Spark 
becomes even less attractive. That, assuming we don't completely reject 
the idea of extending Spark functionality while using Spark Connect, 
effectively limiting the target audience for any 3rd party library.



 > Martin > > > On Fri, May 26, 2023 at 5:39 PM Maciej 
 > wrote: > > It might be a good idea to have a 
discussion about how new connect > clients fit into the overall process 
we have. In particular: > > * Under what conditions do we consider 
adding a new language to the > official channels? What process do we 
follow? * What guarantees do > we offer in respect to these clients? Is 
adding a new client the same > type of commitment as for the core API? 
In other words, do we commit > to maintaining such clients "forever" or 
do we separate the > "official" and "contrib" clients, with the later 
being governed by > the ASF, but not guaranteed to be maintained in the 
future? * Do we > follow the same release schedule as for the core 
project, or rather > release each client separately, after the main 
release is completed? > > Also, an elephant in the room is the future of 
the current API in > Spark 4 and onwards. As useful as connect is, it is 
not exactly a > replacement for many existing deployments. Furthermore, 
it doesn't > make extending Spark much easier and the current ecosystem 
is, > subjectively speaking, a bit brittle. > > -- Best regards, Maciej 
> > > On 5/26

Re: [CONNECT] New Clients for Go and Rust

2023-05-26 Thread Maciej
It might be a good idea to have a discussion about how new connect 
clients fit into the overall process we have. In particular:


 * Under what conditions do we consider adding a new language to the
   official channels?  What process do we follow?
 * What guarantees do we offer in respect to these clients? Is adding a
   new client the same type of commitment as for the core API? In other
   words, do we commit to maintaining such clients "forever" or do we
   separate the "official" and "contrib" clients, with the later being
   governed by the ASF, but not guaranteed to be maintained in the future?
 * Do we follow the same release schedule as for the core project, or
   rather release each client separately, after the main release is
   completed?

Also, an elephant in the room is the future of the current API in Spark 
4 and onwards. As useful as connect is, it is not exactly a replacement 
for many existing deployments. Furthermore, it doesn't make extending 
Spark much easier and the current ecosystem is, subjectively speaking, a 
bit brittle.


--
Best regards,
Maciej


On 5/26/23 07:26, Martin Grund wrote:
Thanks everyone for your feedback! I will work on figuring out what it 
takes to get started with a repo for the go client.


On Thu 25. May 2023 at 21:51 Chao Sun  wrote:

+1 on separate repo too

On Thu, May 25, 2023 at 12:43 PM Dongjoon Hyun
 wrote:
>
> +1 for starting on a separate repo.
>
> Dongjoon.
>
> On Thu, May 25, 2023 at 9:53 AM yangjie01 
wrote:
>>
>> +1 on start this with a separate repo.
>>
>> Which new clients can be placed in the main repo should be
discussed after they are mature enough,
>>
>>
>>
>> Yang Jie
>>
>>
>>
>> 发件人: Denny Lee 
>> 日期: 2023年5月24日 星期三 21:31
>> 收件人: Hyukjin Kwon 
>> 抄送: Maciej , "dev@spark.apache.org"

>> 主题: Re: [CONNECT] New Clients for Go and Rust
>>
>>
>>
>> +1 on separate repo allowing different APIs to run at different
speeds and ensuring they get community support.
>>
>>
>>
>> On Wed, May 24, 2023 at 00:37 Hyukjin Kwon
 wrote:
>>
>> I think we can just start this with a separate repo.
>> I am fine with the second option too but in this case we would
have to triage which language to add into the main repo.
>>
>>
>>
>> On Fri, 19 May 2023 at 22:28, Maciej 
wrote:
>>
>> Hi,
>>
>>
>>
>> Personally, I'm strongly against the second option and have
some preference towards the third one (or maybe a mix of the first
one and the third one).
>>
>>
>>
>> The project is already pretty large as-is and, with an
extremely conservative approach towards removal of APIs, it only
tends to grow over time. Making it even larger is not going to
make things more maintainable and is likely to create an entry
barrier for new contributors (that's similar to Jia's arguments).
>>
>>
>>
>> Moreover, we've seen quite a few different language clients
over the years and all but one or two survived while none is
particularly active, as far as I'm aware.  Taking responsibility
for more clients, without being sure that we have resources to
maintain them and there is enough community around them to make
such effort worthwhile, doesn't seem like a good idea.
>>
>>
>>
>> --
>>
>> Best regards,
>>
>> Maciej Szymkiewicz
>>
>>
>>
>> Web: https://zero323.net
>>
>> PGP: A30CEF0C31A501EC
>>
>>
>>
>>
>>
>> On 5/19/23 14:57, Jia Fan wrote:
>>
>> Hi,
>>
>>
>>
>> Thanks for contribution!
>>
>> I prefer (1). There are some reason:
>>
>>
>>
>> 1. Different repository can maintain independent versions,
different release times, and faster bug fix releases.
>>
>>
>>
>> 2. Different languages have different build tools. Putting them
in one repository will make the main repository more and more
complicated, and it will become extremely difficult to perform a
complete build in the main repository.
>>
>>
>>
>> 3. Different repository will make CI configuration and execute
easier, and the PR and commit lists will be clearer.
>>
>>
>>
>> 4. Other r

Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-26 Thread Maciej
Weren't some of these functions provided only for compatibility  and 
intentionally left out of the language APIs?


--
Best regards,
Maciej

On 5/25/23 23:21, Hyukjin Kwon wrote:
I don't think it'd be a release blocker .. I think we can implement 
them across multiple releases.


On Fri, May 26, 2023 at 1:01 AM Dongjoon Hyun 
 wrote:


Thank you for the proposal.

I'm wondering if we are going to consider them as release blockers
or not.

In general, I don't think those SQL functions should be available
in all languages as release blockers.
(Especially in R or new Spark Connect languages like Go and Rust).

If they are not release blockers, we may allow some existing or
future community PRs only before feature freeze (= branch cut).

Thanks,
Dongjoon.


On Wed, May 24, 2023 at 7:09 PM Jia Fan  wrote:

+1
It is important that different APIs can be used to call the
same function

Ryan Berti  于2023年5月25日周四
01:48写道:

During my recent experience developing functions, I found
that identifying locations (sql + connect
functions.scala + functions.py, FunctionRegistry, +
whatever is required for R) and standards for adding
function signatures was not straight forward (should you
use optional args or overload functions? which col/lit
helpers should be used when?). Are there docs describing
all of the locations + standards for defining a function?
If not, that'd be great to have too.

Ryan Berti

Senior Data Engineer  |  Ads DE

M 7023217573

5808 W Sunset Blvd  |  Los Angeles, CA 90028

<https://www.google.com/maps/search/5808+W+Sunset+Blvd%C2%A0+%7C%C2%A0+Los+Angeles,+CA+90028?entry=gmail=g>



On Wed, May 24, 2023 at 12:44 AM Enrico Minack
 wrote:

+1

Functions available in SQL (more general in one API)
should be available in all APIs. I am very much in
favor of this.

Enrico


Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:


Hi all,

I would like to discuss adding all SQL functions into
Scala, Python and R API.
We have SQL functions that do not exist in Scala,
Python and R around 175.
For example, we don’t have
|pyspark.sql.functions.percentile| but you can invoke
it as a SQL function, e.g., |SELECT percentile(...)|.

The reason why we do not have all functions in the
first place is that we want to
only add commonly used functions, see also
https://github.com/apache/spark/pull/21318 (which I
agreed at that time)

However, this has been raised multiple times over
years, from the OSS community, dev mailing list,
JIRAs, stackoverflow, etc.
Seems it’s confusing about which function is
available or not.

Yes, we have a workaround. We can call all
expressions by |expr("...")| or |call_udf("...",
Columns ...)|
But still it seems that it’s not very user-friendly
because they expect them available under the
functions namespace.

Therefore, I would like to propose adding all
expressions into all languages so that Spark is
simpler and less confusing, e.g., which API is in
functions or not.

Any thoughts?







OpenPGP_signature
Description: OpenPGP digital signature


Re: [CONNECT] New Clients for Go and Rust

2023-05-19 Thread Maciej

Hi,

Personally, I'm strongly against the second option and have some 
preference towards the third one (or maybe a mix of the first one and 
the third one).


The project is already pretty large as-is and, with an extremely 
conservative approach towards removal of APIs, it only tends to grow 
over time. Making it even larger is not going to make things more 
maintainable and is likely to create an entry barrier for new 
contributors (that's similar to Jia's arguments).


Moreover, we've seen quite a few different language clients over the 
years and all but one or two survived while none is particularly active, 
as far as I'm aware.  Taking responsibility for more clients, without 
being sure that we have resources to maintain them and there is enough 
community around them to make such effort worthwhile, doesn't seem like 
a good idea.


--
Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC



On 5/19/23 14:57, Jia Fan wrote:

Hi,

Thanks for contribution!
I prefer (1). There are some reason:

1. Different repository can maintain independent versions, different 
release times, and faster bug fix releases.


2. Different languages have different build tools. Putting them in one 
repository will make the main repository more and more complicated, 
and it will become extremely difficult to perform a complete build in 
the main repository.


3. Different repository will make CI configuration and execute easier, 
and the PR and commit lists will be clearer.


4. Other repository also have different client to governed, like 
clickhouse. It use different repository for jdbc, odbc, c++. Please refer:

https://github.com/ClickHouse/clickhouse-java
https://github.com/ClickHouse/clickhouse-odbc
https://github.com/ClickHouse/clickhouse-cpp

PS: I'm looking forward to the javascript connect client!

Thanks Regards
Jia Fan

Martin Grund  于2023年5月19日周五 20:03写道:

Hi folks,

When Bo (thanks for the time and contribution) started the work on
https://github.com/apache/spark/pull/41036 he started the Go
client directly in the Spark repository. In the meantime, I was
approached by other engineers who are willing to contribute to
working on a Rust client for Spark Connect.

Now one of the key questions is where should these connectors live
and how we manage expectations most effectively.

At the high level, there are two approaches:

(1) "3rd party" (non-JVM / Python) clients should live in separate
repositories owned and governed by the Apache Spark community.

(2) All clients should live in the main Apache Spark repository in
the `connector/connect/client` directory.

(3) Non-native (Python, JVM) Spark Connect clients should not be
part of the Apache Spark repository and governance rules.

Before we iron out how exactly, we mark these clients as
experimental and how we align their release process etc with
Spark, my suggestion would be to get a consensus on this first
question.

Personally, I'm fine with (1) and (2) with a preference for (2).

Would love to get feedback from other members of the community!

Thanks
Martin







OpenPGP_signature
Description: OpenPGP digital signature


Re: Slack for Spark Community: Merging various threads

2023-04-08 Thread Maciej
@Bjørn Matrix (represented by element in the linked summary. Also, since 
the last year, Rocket Chat uses Matrix under the hood) is already used 
for user support and related discussions for a number of large projects, 
since gitter migrated there. And it is not like we need Slack or its 
replacement in the first place. Some of the Slack features are useful 
for us, but its not exactly the best tool for user support IMHO.


@Dongjoon There are probably two more things we should discuss:

 * What are data privacy obligations while keeping a communication
   channel, advertised as official, outside the ASF?  Does it put it
   out of scope for the ASF legal and data privacy teams?

   If I recall correctly, Slack requires at least some of the
   formalities to be handled by the primary owner and as far as I am
   aware the project is not a legal person. Not sure how linen.dev or
   another retention tool fits into all of that, but it's unrealistic
   to expect it makes things easier.

   This might sound hypothetical, but we've already seen users leaking
   sensitive information on the mail list and requesting erasure
   (which, luckily for us, is not technically possible).

 * How are we going to handle moderation, if we assume number of users
   comparable to Delta Lake Slack and open registration? At minimum we
   have to ensure that the ASF Code of Conduct is respected. An
   official channel or not, failure to do that reflects badly on the
   project, the ASF and all of us.

--
Maciej



On 4/7/23 21:02, Bjørn Jørgensen wrote:
Yes, I have done some search for slack alternatives 
<https://itsfoss.com/open-source-slack-alternative/>
I feel that we should do some search, to find if there can be a 
better solution than slack.
For what I have found, there are two that can be an alternative for 
slack.


Rocket.Chat <https://www.rocket.chat/>

and

Zulip Chat <https://zulip.com>
Zulip Cloud Standard is free for open-source projects 
<https://zulip.com/for/open-source/>

Witch means we get

  * Unlimited search history
  * File storage up to 10 GB per user
  * Message retention policies
<https://sparkzulip.zulipchat.com/help/message-retention-policy>
  * Brand Zulip with your logo
  * Priority commercial support
  * Funds the Zulip open source project


Rust is using zulip <https://forge.rust-lang.org/platforms/zulip.html>

We can import chats from slack 
<https://sparkzulip.zulipchat.com/help/import-from-slack>
We can use zulip for events <https://zulip.com/for/events/> With 
multi-use invite links <https://zulip.com/help/invite-new-users>, 
there’s no need to create individual Zulip invitations.  This means 
that PMC doesn't have to send a link to every user.



  CODE BLOCKS

Discuss code with ease using Markdown code blocks, syntax 
highlighting, and code playgrounds 
<https://zulip.com/help/code-blocks#code-playgrounds>.







fre. 7. apr. 2023 kl. 18:54 skrev Holden Karau :

I think there was some concern around how to make any sync channel
show up in logs / index / search results?

On Fri, Apr 7, 2023 at 9:41 AM Dongjoon Hyun
 wrote:

Thank you, All.

I'm very satisfied with the focused and right questions for
the real issues by removing irrelevant claims. :)

Let me collect your relevant comments simply.


# Category 1: Invitation Hurdle

> The key question here is that do PMC members have the
bandwidth of inviting everyone in user@ and dev@?

> Extending this to inviting everyone on @user (over >4k 
subscribers according to the previous thread) might be a stretch,

> we should have an official project Slack with an easy
invitation process.


# Category 2: Controllability

> Additionally. there is no indication thatthe-asf.slack.com
<http://the-asf.slack.com>is intended for general support.

> I would also lean towards a standalone workspace, where we
have more control over organizing the channels,


# Category 3: Policy Suggestion

> *Developer* discussions should still happen on email, JIRA
and GitHub and be async-friendly (72-hour rule) to fit the
ASF’s development model.


Are there any other questions?


Dongjoon.


-- 
Twitter: https://twitter.com/holdenkarau

Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau



--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297





OpenPGP_signature
Description: OpenPGP digital signature


Re: Slack for Spark Community: Merging various threads

2023-04-06 Thread Maciej
Additionally. there is no indication that the-asf.slack.com is intended 
for general support. In particular it states the following


> The Apache Software Foundation has a workspace on Slack 
<https://the-asf.slack.com/> to provide channels on which people working 
on the same ASF project, or in the same area of the Foundation, can 
discuss issues, solve problems, and build community in real-time.


and then

> Other contributors and interested parties (observers, former members, 
software evaluators, members of the media, those without an @apache.org 
address) who want to participate in channels in the ASF workspace can 
use a *guest* account.


Extending this to inviting everyone on @user (over >4k  subscribers 
according to the previous thread) might be a stretch, especially without 
knowing the details of the agreement between the ASF and the Slack 
Technologies.


--
Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC


On 4/6/23 17:13, Denny Lee wrote:
Thanks Dongjoon, but I don't think this is misleading insofar that 
this is not a /self-service process/ but an invite process which 
admittedly I did not state explicitly in my previous thread.  And 
thanks for the invite to the-ASF Slack - I just joined :)


Saying this, I do completely agree with your two assertions:

  * /Shall we narrow-down our focus on comparing the ASF Slack vs
another 3rd-party Slack because all of us agree that this is
important? /
  o Yes, I do agree that is an important aspect, all else being equal.

  * /I'm wondering what ASF misses here if Apache Spark PMC invites
all remaining subscribers of `user@spark` and `dev@spark` mailing
lists./
  o The key question here is that do PMC members have the
bandwidth of inviting everyone in user@ and dev@?   There is a
lot of overhead of maintaining this so that's my key concern
is if we have the number of volunteers to manage this.  Note,
I'm willing to help with this process as well it was just more
of a matter that there are a lot of folks to approve
  o A reason why we may want to consider Spark's own Slack is
because we can potentially create different channels within
Slack to more easily group messages (e.g. different threads
for troubleshooting, RDDs, streaming, etc.).  Again, we'd need
someone to manage this so that way we don't have an out of
control number of channels.

WDYT?



On Wed, Apr 5, 2023 at 10:50 PM Dongjoon Hyun 
 wrote:


Thank you so much, Denny.
Yes, let me comment on a few things.

>  - While there is an ASF Slack
<https://infra.apache.org/slack.html>, it
>    requires an @apache.org <http://apache.org> email address

1. This sounds a little misleading because we can see `guest`
accounts in the same link. People can be invited by "Invite people
to ASF" link. I invited you, Denny, and attached the screenshots.

>   using linen.dev <http://linen.dev> as its Slack archive (so we
can surpass the 90 days limit)

2. The official Foundation-supported Slack workspace preserves all
messages.
    (the-asf.slack.com <http://the-asf.slack.com>)

> Why: Allows for the community to have the option to communicate
with each
> other using Slack; a pretty popular async communication.

3. ASF foundation not only allows but also provides the option to
communicate with each other using Slack as of today.

Given the above (1) and (3), I don't think we asked the right
questions during most of the parts.

1. Shall we narrow-down our focus on comparing the ASF Slack vs
another 3rd-party Slack because all of us agree that this is
important?
2. I'm wondering what ASF misses here if Apache Spark PMC invites
all remaining subscribers of `user@spark` and `dev@spark` mailing
lists.

Thanks,
Dongjoon.

invitation.png
invited.png

On Wed, Apr 5, 2023 at 7:23 PM Denny Lee 
wrote:

There have been a number of threads discussing creating a
Slack for the Spark community that I'd like to try to help
reconcile.

Topic: Slack for Spark

Why: Allows for the community to have the option to
communicate with each other using Slack; a pretty popular
async communication.

Discussion points:

  * There are other ASF projects that use Slack including
Druid <https://druid.apache.org/community/>, Parquet
<https://parquet.apache.org/community/>, Iceberg
<https://iceberg.apache.org/community/>, and Hudi
<https://hudi.apache.org/community/get-involved/>
  * Flink <https://flink.apache.org/community/> is also using
Slack and using linen.dev <http://linen.dev> as its Slack
archive (so we can s

Re: Slack for Spark Community: Merging various threads

2023-04-06 Thread Maciej
as its Slack
archive (so we can surpass the 90 days limit) which is
also Google searchable (Delta Lake
<https://www.linen.dev/s/delta-lake/> is also using this
service as well)
  * While there is an ASF Slack
<https://infra.apache.org/slack.html>, it requires
an @apache.org <http://apache.org> email address to use
which is quite limiting which is why these (and many
other) OSS projects are using the free-tier Slack
  * It does require managing Slack properly as Slack free
edition limits you to approx 100 invites. One of the ways
to resolve this is to create a bit.ly <http://bit.ly> link
so we can manage the invites without regularly updating
the website with the new invite link.

Are there any other points of discussion that we should add
here?  I'm glad to work with whomever to help manage the
various aspects of Slack (code of conduct, linen.dev
<http://linen.dev> and search/archive process, invite
management, etc.).

HTH!
Denny



--
Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC



OpenPGP_signature
Description: OpenPGP digital signature


Re: [DISCUSS] Make release cadence predictable

2023-02-15 Thread Maciej

Hi,

Sorry for a silly question, but do we know what exactly caused these 
delays? Are these avoidable?


It is not a systematic observation, but my general impression is that we 
rarely delay for sake of individual features, unless there is some soft 
consensus about their importance. Arguably, these could be postponed, 
assuming we can adhere to the schedule.


And then, we're left with large, multi-task features. A lot can be done 
with proper timing and design, but in our current process there is no 
way to guarantee that each of these can be delivered within given time 
window.  How are we going to handle these? Delivering half-baked things 
is hardly satisfying solution and more rigid schedule can only increase 
pressure on maintainers. Do we plan to introduce something like feature 
branches for these, to isolate upcoming release in case of delay?


On 2/14/23 19:53, Dongjoon Hyun wrote:

+1 for Hyukjin and Sean's opinion.

Thank you for initiating this discussion.

If we have a fixed-predefined regular 6-month, I believe we can 
persuade the incomplete features to wait for next releases more easily.


In addition, I want to add the first RC1 date requirement because RC1 
always did a great job for us.


I guess `branch-cut + 1M (no later than 1month)` could be the 
reasonable deadline.


Thanks,
Dongjoon.


On Tue, Feb 14, 2023 at 6:33 AM Sean Owen  wrote:

I'm fine with shifting to a stricter cadence-based schedule.
Sometimes, it'll mean some significant change misses a release
rather than delays it. If people are OK with that discipline, sure.
A hard 6-month cycle would mean the minor releases are more
frequent and have less change in them. That's probably OK. We
could also decide to choose a longer cadence like 9 months, but I
don't know if that's better.
I assume maintenance releases would still be as-needed, and major
releases would also work differently - probably no 4.0 until next
year at the earliest.

On Tue, Feb 14, 2023 at 3:01 AM Hyukjin Kwon 
wrote:

Hi all,

*TL;DR*: Branch cut for every 6 months (January and July).

I would like to discuss/propose to make our release cadence
predictable. In our documentation, we mention as follows:

In general, feature (“minor”) releases occur about every 6
months. Hence,
Spark 2.3.0 would generally be released about 6 months
after 2.2.0.

However, the reality is slightly different. Here is the time
it took for the recent releases:

  * Spark 3.3.0 took 8 months
  * Spark 3.2.0 took 7 months
  * Spark 3.1 took 9 months

Here are problems caused by such delay:

  * The whole related schedules are affected in all downstream
projects, vendors, etc.
  * It makes the release date unpredictable to the end users.
  * Developers as well as the release managers have to rush
because of the delay, which prevents us from focusing on
having a proper regression-free release.

My proposal is to branch cut every 6 months (January and July
that avoids the public holidays / vacation period in general)
so the release can happen twice
every year regardless of the actual release date.
I believe it both makes the release cadence predictable, and
relaxes the burden about making releases.

WDYT?



--
Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC



OpenPGP_signature
Description: OpenPGP digital signature


Re: How can I get the same spark context in two different python processes

2022-12-13 Thread Maciej

Hi,

Unfortunately, I don't have a working example I could share at hand, but 
the flow will be roughly like this


- Retrieve an existing Python ClientServer  (gateway) from the SparkContext
- Get its gateway_parameters (some are constant for PySpark, but you'll 
need at least port and auth_token)

- Pass these to a new process and use them to initialize a new ClientServer
- From ClientServer jvm retrieve bindings for JVM SparkContext
- Use JVM binding and gateway to initialize Python SparkContext in your 
process.



Just to reiterate ‒ it is not something that we support (or Py4j for 
that matter) so don't do it unless you fully understand the implications 
(including, but not limited to, risk of leaking the token). Use this 
approach at your own risk.



On 12/13/22 03:52, Kevin Su wrote:

Maciej, Thanks for the reply.
Could you share an example to achieve it?

Maciej mailto:mszymkiew...@gmail.com>> 於 2022 
年12月12日 週一 下午4:41寫道:


Technically speaking, it is possible in stock distribution (can't speak
for Databricks) and not super hard to do (just check out how we
initialize sessions), but definitely not something that we test or
support, especially in a scenario you described.

If you want to achieve concurrent execution, multithreading is normally
more than sufficient and avoids problems with the context.



On 12/13/22 00:40, Kevin Su wrote:
 > I ran my spark job by using databricks job with a single python
script.
 > IIUC, the databricks platform will create a spark context for this
 > python script.
 > However, I create a new subprocess in this script and run some spark
 > code in this subprocess, but this subprocess can't find the
 > context created by databricks.
 > Not sure if there is any api I can use to get the default context.
 >
 > bo yang mailto:bobyan...@gmail.com>
<mailto:bobyan...@gmail.com <mailto:bobyan...@gmail.com>>> 於 2022年
12月
 > 12日 週一 下午3:27寫道:
 >
 >     In theory, maybe a Jupyter notebook or something similar could
 >     achieve this? e.g. running some Jypyter kernel inside Spark
driver,
 >     then another Python process could connect to that kernel.
 >
 >     But in the end, this is like Spark Connect :)
 >
 >
 >     On Mon, Dec 12, 2022 at 2:55 PM Kevin Su mailto:pings...@gmail.com>
 >     <mailto:pings...@gmail.com <mailto:pings...@gmail.com>>> wrote:
 >
 >         Also, is there any way to workaround this issue without
 >         using Spark connect?
 >
 >         Kevin Su mailto:pings...@gmail.com>
<mailto:pings...@gmail.com <mailto:pings...@gmail.com>>> 於
 >         2022年12月12日 週一 下午2:52寫道:
 >
 >             nvm, I found the ticket.
 >             Also, is there any way to workaround this issue without
 >             using Spark connect?
 >
 >             Kevin Su mailto:pings...@gmail.com> <mailto:pings...@gmail.com
<mailto:pings...@gmail.com>>> 於
 >             2022年12月12日 週一 下午2:42寫道:
 >
 >                 Thanks for the quick response? Do we have any PR
or Jira
 >                 ticket for it?
 >
 >                 Reynold Xin mailto:r...@databricks.com>
 >                 <mailto:r...@databricks.com
<mailto:r...@databricks.com>>> 於 2022年12月12日 週一 下
 >                 午2:39寫道:
 >
 >                     Spark Connect :)
 >
 >                     (It’s work in progress)
 >
 >
 >                     On Mon, Dec 12 2022 at 2:29 PM, Kevin Su
 >                     mailto:pings...@gmail.com> <mailto:pings...@gmail.com
<mailto:pings...@gmail.com>>> wrote:
 >
 >                         Hey there, How can I get the same spark
context
 >                         in two different python processes?
 >                         Let’s say I create a context in Process
A, and
 >                         then I want to use python subprocess B to get
 >                         the spark context created by Process A.
How can
 >                         I achieve that?
 >
 >                         I've
 >       
  tried pyspark.sql.SparkSession.builder.appName("spark").getOrCreate(), but it will create a new spark context.

 >

-- 
Best regards,

Maciej Szymkiewicz

Web: https://zero323.net <https://zero323.net>
PGP: A30CEF0C31A501EC



--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC



OpenPGP_signature
Description: OpenPGP digital signature


Re: How can I get the same spark context in two different python processes

2022-12-12 Thread Maciej
Technically speaking, it is possible in stock distribution (can't speak 
for Databricks) and not super hard to do (just check out how we 
initialize sessions), but definitely not something that we test or 
support, especially in a scenario you described.


If you want to achieve concurrent execution, multithreading is normally 
more than sufficient and avoids problems with the context.




On 12/13/22 00:40, Kevin Su wrote:

I ran my spark job by using databricks job with a single python script.
IIUC, the databricks platform will create a spark context for this 
python script.
However, I create a new subprocess in this script and run some spark 
code in this subprocess, but this subprocess can't find the 
context created by databricks.

Not sure if there is any api I can use to get the default context.

bo yang mailto:bobyan...@gmail.com>> 於 2022年12月 
12日 週一 下午3:27寫道:


In theory, maybe a Jupyter notebook or something similar could
achieve this? e.g. running some Jypyter kernel inside Spark driver,
then another Python process could connect to that kernel.

But in the end, this is like Spark Connect :)


On Mon, Dec 12, 2022 at 2:55 PM Kevin Su mailto:pings...@gmail.com>> wrote:

Also, is there any way to workaround this issue without
using Spark connect?

Kevin Su mailto:pings...@gmail.com>> 於
2022年12月12日 週一 下午2:52寫道:

nvm, I found the ticket.
Also, is there any way to workaround this issue without
using Spark connect?

Kevin Su mailto:pings...@gmail.com>> 於
2022年12月12日 週一 下午2:42寫道:

Thanks for the quick response? Do we have any PR or Jira
ticket for it?

Reynold Xin mailto:r...@databricks.com>> 於 2022年12月12日 週一 下
午2:39寫道:

Spark Connect :)

(It’s work in progress)


On Mon, Dec 12 2022 at 2:29 PM, Kevin Su
mailto:pings...@gmail.com>> wrote:

Hey there, How can I get the same spark context
in two different python processes?
Let’s say I create a context in Process A, and
then I want to use python subprocess B to get
the spark context created by Process A. How can
I achieve that?

I've
tried 
pyspark.sql.SparkSession.builder.appName("spark").getOrCreate(), but it will 
create a new spark context.



--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC



OpenPGP_signature
Description: OpenPGP digital signature


Re: Syndicate Apache Spark Twitter to Mastodon?

2022-12-11 Thread Maciej

Thanks for proposing this Holden.

On 12/1/22 17:09, Holden Karau wrote:
The main negatives that I can think of is an additional account for the 
PMC to maintain so if we as a community don’t have many people on 
Mastodon yet it might not be worth it.Would need probably about ~20 
minutes of setup work to make the sync (probably most of it is finding 
someone with the Twitter credentials to enable to sync). The other 
tricky one is picking a server (there is no default ASF server that I 
know of).


Airflow uses fosstodon (@airf...@fosstodon.org) but that's the only ASF 
project on Mastodon I'm aware of. It might be a good starting point for 
us as well, and it is not hard to permanently move an account to a new 
server in the future.


It might be worthwhile to reach out to the ASF and see if there is 
enough interest and some resources to spare to set up a dedicated instance.




On Thu, Dec 1, 2022 at 8:03 AM Russell Spitzer 
mailto:russell.spit...@gmail.com>> wrote:


Since this is just syndication I don't think arguments on the
benefits of Twitter vs Mastodon are that important, it's really just
what are the costs of additionally posting to Mastodon. I'm assuming
those costs are basically 0 since this can be done by a bot? So I
don't think there is any strong reason not to do so.



On Nov 30, 2022, at 5:51 PM, Dmitry mailto:frostb...@gmail.com>> wrote:

My personal opinion, one of the most features of Twiiter that it
is not federated and is good platform for annonces and so on. So
it means "it would be good to reach our users where they are"
means stay in twitter(most companies who use Spark/Databricks are
in Twitter)
For Federated  features, I think Slack would be a better platform,
a lot of Apache Big data projects have slack for federated features

чт, 1 дек. 2022 г., 02:33 Holden Karau mailto:hol...@pigscanfly.ca>>:

I agree that there is probably a majority still on twitter,
but it would be a syndication (e.g. we'd keep both).

As to the # of devs it's hard to say since:
1) It's a federated service
2) Figuring out if an account is a dev or not is hard

But, for example,

There seems to be roughly an aggregate 6 million users (
https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time 
<https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time> ), 
which seems to be about only ~1% of Twitters size.

Nova's (large K8s focused I believe) has ~29k, tech.lgbt has
~6k, The BSD mastodon has ~1k ( https://bsd.network/about
<https://bsd.network/about> )

It's hard to say, but I've noticed a larger number of my tech
affiliated friends moving to Mastodon (personally I now do both).

On Wed, Nov 30, 2022 at 3:17 PM Dmitry mailto:frostb...@gmail.com>> wrote:

Hello,
Does any long-term statistics about number of developers
who moved to mastodon and activity use exists?

I believe the most devs are still using Twitter.


чт, 1 дек. 2022 г., 01:35 Holden Karau
mailto:hol...@pigscanfly.ca>>:

Do we want to start syndicating Apache Spark Twitter
to a Mastodon instance. It seems like a lot of
software dev folks are moving over there and it would
be good to reach our users where they are.

Any objections / concerns? Any thoughts on which
server we should pick if we do this?
-- 
Twitter: https://twitter.com/holdenkarau

<https://twitter.com/holdenkarau>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
YouTube Live Streams:
https://www.youtube.com/user/holdenkarau
<https://www.youtube.com/user/holdenkarau>



-- 
Twitter: https://twitter.com/holdenkarau

<https://twitter.com/holdenkarau>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
<https://www.youtube.com/user/holdenkarau>



--
Twitter: https://twitter.com/holdenkarau <https://twitter.com/holdenkarau>
Books (Learning Spark, High Performance Spark, etc.): 
https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau 
<https://www.youtube.com/user/holdenkarau>


--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC



OpenPGP_signature
Description: OpenPGP digital signature


Re: [VOTE][SPIP] Better Spark UI scalability and Driver stability for large applications

2022-11-16 Thread Maciej

+1

On 11/16/22 13:19, Yuming Wang wrote:

+1, non-binding

On Wed, Nov 16, 2022 at 8:12 PM Yang,Jie(INF) <mailto:yangji...@baidu.com>> wrote:


+1, non-binding

__ __

Yang Jie

__ __

*发件人**: *Mridul Muralidharan mailto:mri...@gmail.com>>
*日期**: *2022年11月16日星期三17:35
*收件人**: *Kent Yao mailto:y...@apache.org>>
*抄送**: *Gengliang Wang mailto:ltn...@gmail.com>>, dev mailto:dev@spark.apache.org>>
*主题**: *Re: [VOTE][SPIP] Better Spark UI scalability and Driver
stability for large applications

__ __

__ __

+1

__ __

Would be great to see history server performance improvements and
lower resource utilization at driver !

__ __

Regards,

Mridul 

__ __

On Wed, Nov 16, 2022 at 2:38 AM Kent Yao mailto:y...@apache.org>> wrote:

+1, non-binding

Gengliang Wang mailto:ltn...@gmail.com>> 于
2022年11月16日周三16:36写道:
>
> Hi all,
>
> I’d like to start a vote for SPIP: "Better Spark UI scalability and Driver 
stability for large applications"
>
> The goal of the SPIP is to improve the Driver's stability by 
supporting storing Spark's UI data on RocksDB. Furthermore, to fasten the read and 
write operations on RocksDB, it introduces a new Protobuf serializer.
>
> Please also refer to the following:
>
> Previous discussion in the dev mailing list: [DISCUSS] SPIP: Better 
Spark UI scalability and Driver stability for large applications
> Design Doc: Better Spark UI scalability and Driver stability for 
large applications
> JIRA: SPARK-41053
>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Kind Regards,
> Gengliang

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
<mailto:dev-unsubscr...@spark.apache.org>



--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC



OpenPGP_signature
Description: OpenPGP digital signature


Re: Is it possible to specify explicitly map() key/value types?

2022-08-27 Thread Maciej

Hi Alex,

You can cast the initial value to the desired type

val mergeExpr = expr("aggregate(data, cast(map() as mapstring>), (acc, i) -> map_concat(acc, i))")


On 8/27/22 13:06, Alexandros Biratsis wrote:

Hello folks,

I would like to ask Spark devs if and it possible to define explicitly 
the key/value types for a map (Spark 3.3.0) as shown below:


|import org.apache.spark.sql.functions.{expr, collect_list} val df =
Seq( (1, Map("k1" -> "v1", "k2" -> "v3")), (1, Map("k3" -> "v3")),
(2, Map("k4" -> "v4")), (2, Map("k6" -> "v6", "k5" -> "v5"))
).toDF("id", "data") val mergeExpr = expr("aggregate(data, map(),
(acc, i) -> map_concat(acc, i))")
df.groupBy("id").agg(collect_list("data").as("data")) .select($"id",
mergeExpr.as("merged_data")) .show(false)|


The above code throws the next error:

AnalysisException: cannot resolve 'aggregate(`data`, map(),
lambdafunction(map_concat(namedlambdavariable(),
namedlambdavariable()), namedlambdavariable(),
namedlambdavariable()), lambdafunction(namedlambdavariable(),
namedlambdavariable()))' due to data type mismatch: argument 3
requires map type, however,
'lambdafunction(map_concat(namedlambdavariable(),
namedlambdavariable()), namedlambdavariable(),
namedlambdavariable())' is of map type.; Project
[id#110, aggregate(data#119, map(),
lambdafunction(map_concat(cast(lambda acc#122 as
map), lambda i#123), lambda acc#122, lambda i#123,
false), lambdafunction(lambda id#124, lambda id#124, false)) AS
aggregate(data, map(),
lambdafunction(map_concat(namedlambdavariable(),
namedlambdavariable()), namedlambdavariable(),
namedlambdavariable()), lambdafunction(namedlambdavariable(),
namedlambdavariable()))#125] +- Aggregate [id#110], [id#110,
collect_list(data#111, 0, 0) AS data#119] +- Project [_1#105 AS
id#110, _2#106 AS data#111] +- LocalRelation [_1#105, _2#106]


It seems that map() is initialised as map when 
map is expected. I believe that the behaviour has changed 
since 2.4.5 where map was initialised as map, and the 
previous example was working.


Is it possible to create a map by specifying the key-value type explicitly?

So far, I came up with a workaround using map('', '') to initialise the 
map for string key-value and using map_filter() to exclude/remove the 
redundant map('', '') key-value item:


val mergeExpr = expr("map_filter(aggregate(data, map('', ''), (acc,
i) -> map_concat(acc, i)), (k, v) -> k != '')")


Thank you for your help

Greetings,
Alex






--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC



OpenPGP_signature
Description: OpenPGP digital signature


Re: [DISCUSS] [Spark SQL, PySpark] Combining StructTypes into a new StructType

2022-08-14 Thread Maciej
I have mixed feelings about this proposal. Merging or diffing schemas is 
a common operation, but specific requirements differ from case to case, 
especially when complex nested data is used.


Even if we put ordering of the fields aside, data types equality 
semantics (StructField in particular) is likely to result in 
implementation which is either confusing or has limited applicability.


Additionally, Scala StructType is already a Seq[StructField] and as such 
provides set-like operations (contains, diff, intersect, union) as well 
as implementations of ++ / :+ / +: so we cannot do much here, without 
breaking the existing API.


On 8/14/22 11:30, Alexandros Biratsis wrote:

Hello Rui and Tim,

Indeed this sound a good idea and quite useful. To make it more formal 
the list of a StructType could be treated as a Scala/Python set by 
providing(inheriting?) the common sets' functionality i.e add, remove, 
concat, intersect, diff etc. The set like functionality could be part of 
StructType class for both languages.


The Scala set collection 
https://www.scala-lang.org/api/2.13.x/scala/collection/immutable/Set.html <https://www.scala-lang.org/api/2.13.x/scala/collection/immutable/Set.html>


Best,
Alex

On Wed, Aug 10, 2022, 08:14 Rui Wang <mailto:amaliu...@apache.org>> wrote:


Thanks for the idea!

I am thinking that the usage of "combined = StructType( a.fields +
b.fields)" is still good because
1) it is not horrible to merge a and b in this way.
2) itself clarifies the intention which is merge two struct's fields
to construct a new struct
3) you also have room to apply more complicated operations on fields
merging. For example remove duplicate files with the same name or
use a.fields but remove some fields if they are in b.

overloading "+" could be
1. it's ambiguous on what this plus is doing.
2. If you define + is a concatenation on the fields, then it's
limited to only do the concatenation. How about other operations
like extract fields from a based on b? Maybe overloading "-"? In
this case the item list will grow.

-Rui

On Tue, Aug 9, 2022 at 1:10 PM Tim mailto:bosse...@posteo.de>> wrote:

Hi all,

this is my first message to the Spark mailing list, so please
bear with
me if I don't fully meet your communication standards.
I just wanted to discuss one aspect that I've stumbled across
several
times over the past few weeks.
When working with Spark, I often run into the problem of having
to merge
two (or more) existing StructTypes into a new one to define a
schema.
Usually this looks similar (in Python) to the following simplified
example:

          a = StructType([StuctField("field_a", StringType())])
          b = StructType([StructField("field_b", IntegerType())])

          combined = StructType( a.fields + b.fields)

My idea, which I would like to discuss, is to shorten the above
example
in Python as follows by supporting Python's add operator for
StructTypes:

          combined = a + b


What do you think of this idea? Are there any reasons why this
is not
yet part of StructType's functionality?
If you support this idea, I could create a first PR for further and
deeper discussion.

Best
Tim

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
    <mailto:dev-unsubscr...@spark.apache.org>



--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC



OpenPGP_signature
Description: OpenPGP digital signature


Re: Welcome Xinrong Meng as a Spark committer

2022-08-10 Thread Maciej

Congratulations Xinrong!

On 8/10/22 07:00, Rui Wang wrote:

Congrats Xinrong!


-Rui

On Tue, Aug 9, 2022 at 8:57 PM Xingbo Jiang <mailto:jiangxb1...@gmail.com>> wrote:


Congratulations!

Yuanjian Li mailto:xyliyuanj...@gmail.com>>
于2022年8月9日 周二20:31写道:

Congratulations, Xinrong!

XiDuo You mailto:ulyssesyo...@gmail.com>>于2022年8月9日 周二19:18写道:

Congratulations!

Haejoon Lee mailto:haejoon@databricks.com>.invalid> 于2022年8月10日
周三 09:30写道:
 >
 > Congrats, Xinrong!!
 >
 > On Tue, Aug 9, 2022 at 5:12 PM Hyukjin Kwon
mailto:gurwls...@gmail.com>> wrote:
 >>
 >> Hi all,
 >>
 >> The Spark PMC recently added Xinrong Meng as a committer
on the project. Xinrong is the major contributor of PySpark
especially Pandas API on Spark. She has guided a lot of new
contributors enthusiastically. Please join me in welcoming
Xinrong!
 >>


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
    <mailto:dev-unsubscr...@spark.apache.org>



--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC



OpenPGP_signature
Description: OpenPGP digital signature


Re: Welcoming three new PMC members

2022-08-10 Thread Maciej

Congratulations!

On 8/10/22 08:14, Yi Wu wrote:

Congrats everyone!



On Wed, Aug 10, 2022 at 11:33 AM Yuanjian Li <mailto:xyliyuanj...@gmail.com>> wrote:


Congrats everyone!

L. C. Hsieh mailto:vii...@gmail.com>>于2022年8月9
日 周二19:01写道:

Congrats!

On Tue, Aug 9, 2022 at 5:38 PM Chao Sun mailto:sunc...@apache.org>> wrote:
 >
 > Congrats everyone!
 >
 > On Tue, Aug 9, 2022 at 5:36 PM Dongjoon Hyun
mailto:dongjoon.h...@gmail.com>> wrote:
 > >
 > > Congrat to all!
 > >
 > > Dongjoon.
 > >
 > > On Tue, Aug 9, 2022 at 5:13 PM Takuya UESHIN
mailto:ues...@happy-camper.st>> wrote:
 > > >
 > > > Congratulations!
 > > >
 > > > On Tue, Aug 9, 2022 at 4:57 PM Hyukjin Kwon
mailto:gurwls...@gmail.com>> wrote:
 > > >>
 > > >> Congrats everybody!
 > > >>
 > > >> On Wed, 10 Aug 2022 at 05:50, Mridul Muralidharan
mailto:mri...@gmail.com>> wrote:
 > > >>>
 > > >>>
 > > >>> Congratulations !
 > > >>> Great to have you join the PMC !!
 > > >>>
 > > >>> Regards,
 > > >>> Mridul
 > > >>>
 > > >>> On Tue, Aug 9, 2022 at 11:57 AM vaquar khan
mailto:vaquar.k...@gmail.com>> wrote:
 > > >>>>
 > > >>>> Congratulations
 > > >>>>
 > > >>>> On Tue, Aug 9, 2022, 11:40 AM Xiao Li
mailto:gatorsm...@gmail.com>> wrote:
 > > >>>>>
 > > >>>>> Hi all,
 > > >>>>>
 > > >>>>> The Spark PMC recently voted to add three new PMC
members. Join me in welcoming them to their new roles!
 > > >>>>>
 > > >>>>> New PMC members: Huaxin Gao, Gengliang Wang and Maxim
Gekk
 > > >>>>>
 > > >>>>> The Spark PMC
 > > >
 > > >
 > > >
 > > > --
 > > > Takuya UESHIN
 > > >
 > >
 > >
-----
 > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
<mailto:dev-unsubscr...@spark.apache.org>
 > >
 >
 >
-
 > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
<mailto:dev-unsubscr...@spark.apache.org>
 >

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
<mailto:dev-unsubscr...@spark.apache.org>



--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC



OpenPGP_signature
Description: OpenPGP digital signature


Re: Introducing "Pandas API on Spark" component in JIRA, and use "PS" PR title component

2022-05-17 Thread Maciej
Sounds good!

+1

On 5/17/22 06:08, Yikun Jiang wrote:
> It's a pretty good idea, +1.
> 
> To be clear in Github:
> 
> - For each PR Title: [SPARK-XXX][PYTHON][PS] The Pandas on spark pr title
> (*still keep [PYTHON]* and [PS] new added)
> 
> - For PR label: new added: `PANDAS API ON Spark`, still keep: `PYTHON`,
> `CORE`
> (*still keep `PYTHON`, `CORE`* and `PANDAS API ON SPARK` new added)
> https://github.com/apache/spark/pull/36574
> <https://github.com/apache/spark/pull/36574>
> 
> Right?
> 
> Regards,
> Yikun
> 
> 
> On Tue, May 17, 2022 at 11:26 AM Hyukjin Kwon  <mailto:gurwls...@gmail.com>> wrote:
> 
> Hi all,
> 
> What about we introduce a component in JIRA "Pandas API on Spark",
> and use "PS"  (pandas-on-Spark) in PR titles? We already use "ps" in
> many places when we: import pyspark.pandas as ps.
> This is similar to "Structured Streaming" in JIRA, and "SS" in PR title.
> 
> I think it'd be easier to track the changes here with that.
> Currently it's a bit difficult to identify it from pure PySpark changes.
> 


-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC


OpenPGP_signature
Description: OpenPGP digital signature


Re: Apache Spark 3.3 Release

2022-04-29 Thread Maciej
[SPARK-34863][SQL] Support complex
> types for Parquet vectorized reader
> >> > >> #35848 [SPARK-38548][SQL] New SQL function:
> try_sum
> >> > >>
> >> > >> Do you mean we should include them, or
> exclude them from 3.3?
> >> > >>
> >> > >> Thanks,
> >> > >> Chao
> >> > >>
> >> > >> On Tue, Mar 15, 2022 at 9:56 AM Dongjoon
> Hyun  <mailto:dongjoon.h...@gmail.com>> wrote:
> >> > >> >
> >> > >> > The following was tested and merged a few
> minutes ago. So, we can remove it from the list.
> >> > >> >
> >> > >> > #35819 [SPARK-38524][SPARK-38553][K8S]
> Bump Volcano to v1.5.1
> >> > >> >
> >> > >> > Thanks,
> >> > >> > Dongjoon.
> >> > >> >
> >> > >> > On Tue, Mar 15, 2022 at 9:48 AM Xiao Li
> mailto:gatorsm...@gmail.com>>
> wrote:
> >> > >> >>
> >> > >> >> Let me clarify my above suggestion. Maybe
> we can wait 3 more days to collect the list of
> actively developed PRs that we want to merge to 3.3
> after the branch cut?
>     >> > >> >>
> >> > >> >> Please do not rush to merge the PRs that
> are not fully reviewed. We can cut the branch this
> Friday and continue merging the PRs that have been
> discussed in this thread. Does that make sense?
> >> > >> >>
> >> > >> >> Xiao
> >> > >> >>
> >> > >> >>
> >> > >> >>
> >> > >> >> Holden Karau  <mailto:hol...@pigscanfly.ca>> 于2022年3月15日周二
> 09:10写道:
> >> > >> >>>
> >> > >> >>> May I suggest we push out one week
> (22nd) just to give everyone a bit of breathing
> space? Rushed software development more often
> results in bugs.
> >> > >> >>>
> >> > >> >>> On Tue, Mar 15, 2022 at 6:23 AM Yikun
> Jiang  <mailto:yikunk...@gmail.com>> wrote:
> >> > >> >>>>
> >> > >> >>>> > To make our release time more
> predictable, let us collect the PRs and wait three
> more days before the branch cut?
> >> > >> >>>>
> >> > >> >>>> For SPIP: Support Customized Kubernetes
> Schedulers:
> >> > >> >>>> #35819 [SPARK-38524][SPARK-38553][K8S]
> Bump Volcano to v1.5.1
> >> > >> >>>>
> >> > >> >>>> Three more days are OK for this from my
> view.
> >> > >> >>>>
> >> > >> >>>> Regards,
> >> > >> >>>> Yikun
> >> > >> >>>
> >> > >> >>> --
> >> > >> >>> Twitter: https://twitter.com/holdenkarau
> <https://twitter.com/holdenkarau>
> >> > >> >>> Books (Learning Spark, High Performance
> Spark, etc.): https://amzn.to/2MaRAG9
> <https://amzn.to/2MaRAG9>
> >> > >> >>> YouTube Live Streams:
> https://www.youtube.com/user/holdenkarau
> <https://www.youtube.com/user/holdenkarau>
> >
> >
> >
> > --
> > Twitter: https://twitter.com/holdenkarau
> <https://twitter.com/holdenkarau>
> > Books (Learning Spark, High Performance Spark,
> etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
> > YouTube Live Streams:
> https://www.youtube.com/user/holdenkarau
> <https://www.youtube.com/user/holdenkarau>
> 
> 
> -
> To unsubscribe e-mail:
> dev-unsubscr...@spark.apache.org
> <mailto:dev-unsubscr...@spark.apache.org>
> 


-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC


OpenPGP_signature
Description: OpenPGP digital signature


Re: Apache Spark 3.3 Release

2022-03-06 Thread Maciej
Ideally, we should complete these

- [SPARK-37093] Inline type hints python/pyspark/streaming
- [SPARK-37395] Inline type hint files for files in python/pyspark/ml
- [SPARK-37396] Inline type hint files for files in python/pyspark/mllib

All tasks have either PR in progress or someone working on a one, so the
the limiting factor is our ability to review these.

On 3/3/22 19:44, Maxim Gekk wrote:
> Hello All,
> 
> I would like to bring on the table the theme about the new Spark release
> 3.3. According to the public schedule at
> https://spark.apache.org/versioning-policy.html
> <https://spark.apache.org/versioning-policy.html>, we planned to start
> the code freeze and release branch cut on March 15th, 2022. Since this
> date is coming soon, I would like to take your attention on the topic
> and gather objections that you might have.
> 
> Bellow is the list of ongoing and active SPIPs:
> 
> Spark SQL:
> - [SPARK-31357] DataSourceV2: Catalog API for view metadata
> - [SPARK-35801] Row-level operations in Data Source V2
> - [SPARK-37166] Storage Partitioned Join
> 
> Spark Core:
> - [SPARK-20624] Add better handling for node shutdown
> - [SPARK-25299] Use remote storage for persisting shuffle data
> 
> PySpark:
> - [SPARK-26413] RDD Arrow Support in Spark Core and PySpark
> 
> Kubernetes:
> - [SPARK-36057] Support Customized Kubernetes Schedulers
> 
> Probably, we should finish if there are any remaining works for Spark
> 3.3, and switch to QA mode, cut a branch and keep everything on track. I
> would like to volunteer to help drive this process.
> 
> Best regards,
> Max Gekk


-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC


OpenPGP_signature
Description: OpenPGP digital signature


Re: [How To] run test suites for specific module

2022-01-24 Thread Maciej
Hi,

Please check the relevant section of the developer tools docs:

https://spark.apache.org/developer-tools.html#running-individual-tests

On 1/25/22 00:44, Fangjia Shen wrote:
> Hello all,
> 
> How do you run Spark's test suites when you want to test the correctness
> of your code? Is there a way to run a specific test suite for Spark? For
> example, running test suite XXXSuite alone, instead of every class under
> the test/ directories.
> 
> Here's some background info about what I want to do: I'm a graduate
> student trying to study Spark's design and find ways to improve Spark's
> performance by doing Software/Hardware co-design. I'm relatively new to
> Maven and so far struggling to find to a way to properly run Spark's own
> test suites.
> 
> Let's say I did some modifications to a XXXExec node which belongs to
> the org.apache.spark.sql package. I want to see if my design passes the
> test cases. What should I do?
> 
> 
> What command should I use:
> 
>  */build/mvn test *  or  */dev/run-tests*  ?
> 
> And where should I run that command:
> 
>     **  or  ** ? - where  is where
> the modified scala file is located, e.g. "/sql/core/".
> 
> 
> I tried adding -Dtest=XXXSuite to *mvn test *but still get to run tens
> of thousands of tests. This is taking way too much time and unbearable
> if I'm just modifying a few file in a specific module.
> 
> I would really appreciate any suggestion or comment.
> 
> 
> Best regards,
> 
> Fangjia Shen
> 
> Purdue University
> 
> 
> 


-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC


OpenPGP_signature
Description: OpenPGP digital signature


Re: [VOTE] Release Spark 3.2.1 (RC2)

2022-01-21 Thread Maciej
   PySpark you can set up a virtual env and
> install the current RC and see if
> anything important breaks, in the
> Java/Scala you can add the staging
> repository to your projects resolvers
> and test with the RC (make sure to clean
> up the artifact cache before/after so
> you don't end up building with a out of
> date RC going forward).
> ===
> What should happen to JIRA tickets still
> targeting 3.2.1?
> ===
> The current list of open tickets
> targeted at 3.2.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK
> 
> <https://issues.apache.org/jira/projects/SPARK>and
> search for "Target Version/s" = 3.2.1
> Committers should look at those and
> triage. Extremely important bug fixes,
> documentation, and API tweaks that
> impact compatibility should be worked on
> immediately. Everything else please
> retarget to an appropriate release.
> == But my bug isn't
> fixed? == In order to
> make timely releases, we will typically
> not hold the release unless the bug in
> question is a regression from the
> previous release. That being said, if
> there is something which is a regression
> that has not been correctly targeted
> please ping me or a committer to help
> target the issue.
> 
> 
> 
> -- 
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
> 
> +47 480 94 297
> 
> 
> 
> -- 
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
> 
> +47 480 94 297
> 
> 
> 
> -- 
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
> 
> +47 480 94 297
> 


-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC


OpenPGP_signature
Description: OpenPGP digital signature


Re: PySpark Dynamic DataFrame for easier inheritance

2021-12-29 Thread Maciej
On 12/29/21 16:18, Pablo Alcain wrote:
> Hey Maciej! Thanks for your answer and the comments :) 
> 
> On Wed, Dec 29, 2021 at 3:06 PM Maciej  <mailto:mszymkiew...@gmail.com>> wrote:
> 
> This seems like a lot of trouble for not so common use case that has
> viable alternatives. Once you assume that class is intended for
> inheritance (which, arguably we neither do or imply a the moment) you're
> even more restricted that we are right now, according to the project
> policy and need for keeping things synchronized across all languages.
> 
> By "this" you mean the modification of the DataFrame, the implementation
> of a new pyspark class (DynamicDataFrame in this case) or the approach
> in general?

I mean promoting DataFrame as extensible in general. It is a risk of
getting stuck with specific API, even more than we are right now, with
little reward at the end.

Additionally:

- As far as I am aware nothing suggests that it is widely requested
feature (corresponding SO questions didn't get much traffic over the
years and I don't think we have any preceding JIRA tickets).
- It can be addressed outside the project (within user codebase or as a
standalone package) with minimal or no overhead.

That being said ‒ if we're going to rewrite Python DataFrame methods to
return instance type, I strongly believe that the existing methods
should be marked as final.

>  
> 
> 
> On Scala side, I would rather expect to see type classes than direct
> inheritance so this might be a dead feature from the start.
> 
> As of Python (sorry if I missed something in the preceding discussion),
> quite natural approach would be to wrap DataFrame instance in your
> business class and delegate calls to the wrapped object. A very naive
> implementation could look like this
> 
> from functools import wraps
> 
> class BusinessModel:
>     @classmethod
>     def delegate(cls, a):
>         def _(*args, **kwargs):
>             result = a(*args, **kwargs)
>             if isinstance(result, DataFrame):
>                 return  cls(result)
>             else:
>                 return result
> 
>         if callable(a):
>             return wraps(a)(_)
>         else:
>             return a
> 
>     def __init__(self, df):
>         self._df = df
> 
>     def __getattr__(self, name):
>         return BusinessModel.delegate(getattr(self._df, name))
> 
>     def with_price(self, price=42):
>         return self.selectExpr("*", f"{price} as price")
> 
> 
> 
> Yes, effectively the solution is very similar to this one. I believe
> that the advantage of doing it without hijacking with the decorator the
> delegation is that you can still maintain static typing.

You can maintain type checker compatibility (it is easier with stubs,
but you can do it with inline hints as well, if I recall correctly) here
as well.

> On the other
> hand (and this is probably a minor issue), when following this approach
> with the `isinstance` checking for the casting you might end up casting
> the `.summary()` and `.describe()` methods that probably you want still
> to keep as "pure" DataFrames. If you see it from this perspective, then
> "DynamicDataFrame" would be the boilerplate code that allows you to
> decide more granularly what methods you want to delegate.

You can do it with `__getattr__` as well. There are probably some edge
cases (especially when accessing columns with `.`), but it should be
still manageable.


Just to be clear ‒ I am not insisting that this is somehow superior
solution (there are things that cannot be done through delegation).

> 
> (BusinessModel(spark.createDataFrame([(1, "DEC")], ("id", "month")))
>     .select("id")
>     .with_price(0.0)
>     .select("price")
>     .show())
> 
> 
> but it can be easily adjusted to handle more complex uses cases,
> including inheritance.
> 
> 
> 
> On 12/29/21 12:54, Pablo Alcain wrote:
> > Hey everyone! I'm re-sending this e-mail, now with a PR proposal
> > (https://github.com/apache/spark/pull/35045
> <https://github.com/apache/spark/pull/35045>
> > <https://github.com/apache/spark/pull/35045
> <https://github.com/apache/spark/pull/35045>> if you want to take a look
> > at the code with a couple of examples). The proposed change includes
> > only a new class that would extend only the Python API without
> doing any
> > change to the underlying scala code. The benefit would be that the new
> > code only e

Re: PySpark Dynamic DataFrame for easier inheritance

2021-12-29 Thread Maciej
aFrame`, inherit
> from `DataFrame` and implement the methods there. This one solves
> all the issues, but with a caveat: the chainable methods cast the
> result explicitly to `DataFrame` (see
> 
> https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910
> 
> <https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910>
> e g). Therefore, everytime you use one of the parent's methods you'd
> have to re-cast to `MyBusinessDataFrame`, making the code cumbersome.
> 
> In view of these pitfalls we decided to go for a slightly different
> approach, inspired by #3: We created a class called
> `DynamicDataFrame` that overrides the explicit call to `DataFrame`
> as done in PySpark but instead casted dynamically to
> `self.__class__` (see
> 
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21
> 
> <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21>
> e g). This allows the fluent methods to always keep the same class,
> making chainability as smooth as it is with pyspark dataframes.
> 
> As an example implementation, here's a link to a gist
> (https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e 
> <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e>)
> that implemented dynamically `withColumn` and `select` methods and
> the expected output.
> 
> I'm sharing this here in case you feel like this approach can be
> useful for anyone else. In our case it greatly sped up the
> development of abstraction layers and allowed us to write cleaner
> code. One of the advantages is that it would simply be a "plugin"
> over pyspark, that does not modify anyhow already existing code or
> application interfaces.
> 
> If you think that this can be helpful, I can write a PR as a more
> refined proof of concept.
> 
> Thanks!
> 
> Pablo
> 


-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC


OpenPGP_signature
Description: OpenPGP digital signature


[R] SparkR on conda-forge

2021-12-19 Thread Maciej
Hi everyone,

FYI ‒ thanks to good folks from conda-forge we have now these:

  * https://github.com/conda-forge/r-sparkr-feedstock
  * https://anaconda.org/conda-forge/r-sparkr

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC


OpenPGP_signature
Description: OpenPGP digital signature


Re: [MISC] Should we add .github/FUNDING.yml

2021-12-15 Thread Maciej
Makes sense. Thanks!

On 12/15/21 21:36, Jungtaek Lim wrote:
> If ASF wants to do it, INFRA could probably deal with it for entire
> projects, like ASF code of conduct being exposed to the right side of
> the all ASF github repos recently.
>
> On Wed, Dec 15, 2021 at 11:49 PM Sean Owen  wrote:
>
> It might imply that this is a way to fund Spark alone, and it
> isn't. Probably no big deal either way but maybe not worth it. It
> won't be a mystery how to find and fund the ASF for the few orgs
> that want to, as compared to a small project
>
> On Wed, Dec 15, 2021, 8:34 AM Maciej  wrote:
>
> Hi All,
>
> Just wondering ‒ would it make sense to add
> .github/FUNDING.yml with custom link pointing to one (or both)
> of these:
>
>   * https://www.apache.org/foundation/sponsorship.html
>   * https://www.apache.org/foundation/contributing.html
>
>
> -- 
> Best regards,
> Maciej Szymkiewicz
>
>     Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC


OpenPGP_signature
Description: OpenPGP digital signature


[MISC] Should we add .github/FUNDING.yml

2021-12-15 Thread Maciej
Hi All,

Just wondering ‒ would it make sense to add .github/FUNDING.yml with
custom link pointing to one (or both) of these:

  * https://www.apache.org/foundation/sponsorship.html
  * https://www.apache.org/foundation/contributing.html


-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC


OpenPGP_signature
Description: OpenPGP digital signature


Nabble archive is down

2021-08-17 Thread Maciej
Hi everyone,

It seems like Nabble is downsizing and nX.nabble.com servers, including
one with Spark user and dev lists, are already down. Do plan ask them to
preserve the content (I haven't seen any related requests on their
support forum) or should we update website links to point to the ASF
archives?

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC




OpenPGP_signature
Description: OpenPGP digital signature


Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Maciej
You're right, but with the native dependencies (this is the case for the
packages I've mentioned before) we have to bundle complete environments.
It is doable, but if you do that, you're actually better off with base
image. I don't insist it is something we have to address right now, just
something to keep in mind.

On 8/17/21 10:05 AM, Mich Talebzadeh wrote:
> Of course with PySpark, there is the option of putting your
> packages in gz format and send them at spark-submit time
>
> --conf "spark.yarn.dist.archives"=pyspark_venv.tar.gz#environment \
>
> However, in the Kubernetes cluster that file is going to be fairly
> massive  and will take time to unzip and share. The interpreter will
> be what it comes with the docker image!
>
>
>
>
>
>
>   ** view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>  
>
> *Disclaimer:* Use it at your own risk.Any and all responsibility for
> any loss, damage or destruction of data or any other property which
> may arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary
> damages arising from such loss, damage or destruction.
>
>  
>
>
>
> On Mon, 16 Aug 2021 at 18:46, Maciej  <mailto:mszymkiew...@gmail.com>> wrote:
>
> I have a few concerns regarding PySpark and SparkR images.
>
> First of all, how do we plan to handle interpreter versions?
> Ideally, we should provide images for all supported variants, but
> based on the preceding discussion and the proposed naming
> convention, I assume it is not going to happen. If that's the
> case, it would be great if we could fix interpreter versions based
> on some support criteria (lowest supported, lowest non-deprecated,
> highest supported at the time of release, etc.)
>
> Currently, we use the following:
>
>   * for R use buster-cran35 Debian repositories which install R
> 3.6 (provided version already changed in the past and broke
> image build ‒ SPARK-28606).
>   * for Python we depend on the system provided python3 packages,
> which currently provides Python 3.7.
>
> which don't guarantee stability over time and might be hard to
> synchronize with our support matrix.
>
> Secondly, omitting libraries which are required for the full
> functionality and performance, specifically
>
>   * Numpy, Pandas and Arrow for PySpark
>   * Arrow for SparkR
>
> is likely to severely limit usability of the images (out of these,
> Arrow is probably the hardest to manage, especially when you
> already depend on system packages to provide R or Python interpreter).
>
>
> On 8/14/21 12:43 AM, Mich Talebzadeh wrote:
>> Hi,
>>
>> We can cater for multiple types (spark, spark-py and spark-r) and
>> spark versions (assuming they are downloaded and available).
>> The challenge is that these docker images built are snapshots.
>> They cannot be amended later and if you change anything by going
>> inside docker, as soon as you are logged out whatever you did is
>> reversed.
>>
>> For example, I want to add tensorflow to my docker image. These
>> are my images
>>
>> REPOSITORY                                TAG           IMAGE ID 
>>      CREATED         SIZE
>> eu.gcr.io/axial-glow-224522/spark-py
>> <http://eu.gcr.io/axial-glow-224522/spark-py>      java8_3.1.1 
>>  cfbb0e69f204   5 days ago      2.37GB
>> eu.gcr.io/axial-glow-224522/spark
>> <http://eu.gcr.io/axial-glow-224522/spark>         3.1.1       
>>  8d1bf8e7e47d   5 days ago      805MB
>>
>> using image ID I try to log in as root to the image
>>
>> *docker run -u0 -it cfbb0e69f204 bash*
>>
>> root@b542b0f1483d:/opt/spark/work-dir# pip install keras
>> Collecting keras
>>   Downloading keras-2.6.0-py2.py3-none-any.whl (1.3 MB)
>>      || 1.3 MB 1.1 MB/s
>> Installing collected packages: keras
>> Successfully installed keras-2.6.0
>> WARNING: Running pip as the 'root' user can result in broken
>> permissions and conflicting behaviour with the system package
>> manager. It is recommended to use a virtual environment instead:
>> https://pip.pypa.io/warnings/venv <https://pip.pypa.io/warnings/venv>
>> root@b542b0f1483d:/opt/spark/work-dir# pip list
>> Package       Version
>> - ---
>> asn1crypto    0.24.0
>> cryptography  2.6

Re: Time to start publishing Spark Docker Images?

2021-08-16 Thread Maciej
ub.docker.com/r/rayproject/ray
> 
> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhub.docker.com%2Fr%2Frayproject%2Fray=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cd97d97be540246aa975308d95e260c99%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637644339790709619%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=%2F%2BPp69I10cyEeSTp6POoNZObOpkkzcZfB35vcdkR8P8%3D=0>
> https://hub.docker.com/u/daskdev
> 
> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhub.docker.com%2Fu%2Fdaskdev=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cd97d97be540246aa975308d95e260c99%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637644339790709619%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=jrQU9WbtFLM1T71SVaZwa0U57F8GcBSFHmXiauQtou0%3D=0>)
> and ASF projects
> (https://hub.docker.com/u/apache
> 
> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhub.docker.com%2Fu%2Fapache=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cd97d97be540246aa975308d95e260c99%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637644339790719573%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=yD8NWSYhhL6%2BDb3D%2BfD%2F8ynKAL4Wp8BKDMHV0n7jHHM%3D=0>)
> now publish their images to dockerhub.
>
>  
>
> We've already got the docker image
> tooling in place, I think we'd
> need to ask the ASF to grant
> permissions to the PMC to publish
> containers and update the release
> steps but I think this could be
> useful for folks.
>
>  
>
> Cheers,
>
>  
>
> Holden
>
>  
>
> -- 
>
> Twitter: https://twitter.com/holdenkarau
> 
> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fholdenkarau=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cd97d97be540246aa975308d95e260c99%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637644339790719573%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=4qhg1CzKNiiRZkbvzKMp7WL4BoYLzPZ%2FOpFwHu8KNmg%3D=0>
>
> Books (Learning Spark, High
> Performance Spark,
> etc.): https://amzn.to/2MaRAG9 
> 
> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Famzn.to%2F2MaRAG9=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cd97d97be540246aa975308d95e260c99%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637644339790719573%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=5UCR1Qn0fLovLAdTFnJBnLYF3e2NRnL8wEYPhCfLf2A%3D=0>
>
> YouTube Live
> Streams: 
> https://www.youtube.com/user/holdenkarau
> 
> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fuser%2Fholdenkarau=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cd97d97be540246aa975308d95e260c99%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637644339790729540%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=LbsZdvDNTAc804N2dknen%2BoJavleIsh5vwpNaj7xIio%3D=0>
>
> 
> -
> To unsubscribe e-mail:
> dev-unsubscr...@spark.apache.org
> <mailto:dev-unsubscr...@spark.apache.org>
>
> -- 
>
> John Zhuge
>
>
>  
>
> -- 
>
> Twitter: https://twitter.com/holdenkarau
> 
> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fholdenkarau=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cd97d97be540246aa975308d95e260c99

Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-26 Thread Maciej
+1 (nonbinding)

On 3/26/21 3:52 PM, Hyukjin Kwon wrote:
>
> Hi all,
>
> I’d like to start a vote for SPIP: Support pandas API layer on PySpark.
>
> The proposal is to embrace Koalas in PySpark to have the pandas API
> layer on PySpark.
>
>
> Please also refer to:
>
>   * Previous discussion in dev mailing list: [DISCUSS] Support pandas
> API layer on PySpark
> 
> <http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html>.
>   * JIRA: SPARK-34849 <https://issues.apache.org/jira/browse/SPARK-34849>
>   * Koalas internals documentation:
> 
> https://docs.google.com/document/d/1tk24aq6FV5Wu2bX_Ym606doLFnrZsh4FdUd52FqojZU/edit
> 
> <https://docs.google.com/document/d/1tk24aq6FV5Wu2bX_Ym606doLFnrZsh4FdUd52FqojZU/edit>
>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
>

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



OpenPGP_signature
Description: OpenPGP digital signature


Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Maciej
this is a win-win strategy for the growth of both
> pandas and PySpark.
>
>
> In fact, there are already similar tries such as Dask
> <https://dask.org/>and Modin
> <https://modin.readthedocs.io/en/latest/>(other than
> Koalas <https://github.com/databricks/koalas>). They
> are all growing fast and successfully, and I find that
> people compare it to PySpark from time to time, for
> example, see Beyond Pandas: Spark, Dask, Vaex and
> other big data technologies battling head to head
> 
> <https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13>.
>
>  
>
>  *
>
> There are many important features missing that are
> very common in data science. One of the most
>     important features is plotting and drawing a
> chart. Almost every data scientist plots and draws
> a chart to understand their data quickly and
> visually in their daily work but this is missing
> in PySpark. Please see one example in pandas:
>
>
>  
>
> I do recommend taking a quick look for blog posts and
> talks made for pandas on Spark:
> 
> https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html
> 
> <https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html>.
> They explain why we need this far more better.
>
>

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



OpenPGP_signature
Description: OpenPGP digital signature


Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-30 Thread Maciej
Just thinking out loud ‒ if there is community need for providing
language bindings for less popular SQL functions, could these live
outside main project or even outside the ASF?  As long as expressions
are already implemented, bindings are trivial after all.

If could also allow usage of more scalable hierarchy (let's say with
modules / packages per function family).

On 1/29/21 5:01 AM, Hyukjin Kwon wrote:
> FYI exposing methods with Column signature only is already documented
> on the top of functions.scala, and I believe that has been the current
> dev direction if I am not mistaken.
>
> Another point is that we should rather expose commonly used
> expressions. Its best if it considers language specific context. Many
> of expressions are for SQL compliance. Many data silence python
> libraries don't support such features as an example.
>
>
>
> On Fri, 29 Jan 2021, 12:04 Matthew Powers,
> mailto:matthewkevinpow...@gmail.com>>
> wrote:
>
> Thanks for the thoughtful responses.  I now understand why adding
> all the functions across all the APIs isn't the default.
>
> To Nick's point, relying on heuristics to gauge user interest, in
> addition to personal experience, is a good idea.  The
> regexp_extract_all SO thread has 16,000 views
> 
> <https://stackoverflow.com/questions/47981699/extract-words-from-a-string-column-in-spark-dataframe/47989473>,
> so I say we set the threshold to 10k, haha, just kidding!  Like
> Sean mentioned, we don't want to add niche functions.  Now we just
> need a way to figure out what's niche!
>
> To Reynolds point on overloading Scala functions, I think we
> should start trying to limit the number of overloaded functions. 
> Some functions have the columnName and column object function
> signatures.  e.g. approx_count_distinct(columnName: String, rsd:
> Double) and approx_count_distinct(e: Column, rsd: Double).  We can
> just expose the approx_count_distinct(e: Column, rsd: Double)
> variety going forward (not suggesting any backwards incompatible
> changes, just saying we don't need the columnName-type functions
> for new stuff).
>
> Other functions have one signature with the second object as a
> Scala object and another signature with the second object as a
> column object, e.g. date_add(start: Column, days: Column) and
> date_add(start: Column, days: Int).  We can just expose the
> date_add(start: Column, days: Column) variety cause it's general
> purpose.  Let me know if you think that avoiding Scala function
> overloading will help Reynold.
>
> Let's brainstorm Nick's idea of creating a framework that'd test
> Scala / Python / SQL / R implementations in one-fell-swoop.  Seems
> like that'd be a great way to reduce the maintenance burden. 
> Reynold's regexp_extract code from 5 years ago is largely still
> intact - getting the job done right the first time is another
> great way to avoid maintenance!
>
> On Thu, Jan 28, 2021 at 6:38 PM Reynold Xin  <mailto:r...@databricks.com>> wrote:
>
> There's another thing that's not mentioned … it's primarily a
> problem for Scala. Due to static typing, we need a very large
> number of function overloads for the Scala version of each
> function, whereas in SQL/Python they are just one. There's a
> limit on how many functions we can add, and it also makes it
> difficult to browse through the docs when there are a lot of
> functions.
>
>
>
> On Thu, Jan 28, 2021 at 1:09 PM, Maciej
> mailto:mszymkiew...@gmail.com>> wrote:
>
> Just my two cents on R side.
>
> On 1/28/21 10:00 PM, Nicholas Chammas wrote:
>> On Thu, Jan 28, 2021 at 3:40 PM Sean Owen
>> mailto:sro...@gmail.com>> wrote:
>>
>> It isn't that regexp_extract_all (for example) is
>> useless outside SQL, just, where do you draw the
>> line? Supporting 10s of random SQL functions across 3
>> other languages has a cost, which has to be weighed
>> against benefit, which we can never measure well
>> except anecdotally: one or two people say "I want
>> this" in a sea of hundreds of thousands of users.
>>
>>
>> +1 to this, but I will add that Jira and Stack Overflow
>> activity can sometimes give good signals about API gaps
>> that are frustrating users. If there is an SO question
>> with 30K views about how to do something that shoul

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-28 Thread Maciej
Just my two cents on R side.

On 1/28/21 10:00 PM, Nicholas Chammas wrote:
> On Thu, Jan 28, 2021 at 3:40 PM Sean Owen  <mailto:sro...@gmail.com>> wrote:
>
> It isn't that regexp_extract_all (for example) is useless outside
> SQL, just, where do you draw the line? Supporting 10s of random
> SQL functions across 3 other languages has a cost, which has to be
> weighed against benefit, which we can never measure well except
> anecdotally: one or two people say "I want this" in a sea of
> hundreds of thousands of users.
>
>
> +1 to this, but I will add that Jira and Stack Overflow activity can
> sometimes give good signals about API gaps that are frustrating users.
> If there is an SO question with 30K views about how to do something
> that should have been easier, then that's an important signal about
> the API.
>
> For this specific case, I think there is a fine argument
> that regexp_extract_all should be added simply for consistency
> with regexp_extract. I can also see the argument
> that regexp_extract was a step too far, but, what's public is now
> a public API.
>
>
> I think in this case a few references to where/how people are having
> to work around missing a direct function for regexp_extract_all could
> help guide the decision. But that itself means we are making these
> decisions on a case-by-case basis.
>
> From a user perspective, it's definitely conceptually simpler to have
> SQL functions be consistent and available across all APIs.
>
> Perhaps if we had a way to lower the maintenance burden of keeping
> functions in sync across SQL/Scala/Python/R, it would be easier for
> everyone to agree to just have all the functions be included across
> the board all the time.

Python aligns quite well with Scala so that might be fine, but R is a
bit tricky thing. Especially lack of proper namespaces makes it rather
risky to have packages that export hundreds of functions. sparkly
handles this neatly with NSE, but I don't think we're going to go this way.

>
> Would, for example, some sort of automatic testing mechanism for SQL
> functions help here? Something that uses a common function testing
> specification to automatically test SQL, Scala, Python, and R
> functions, without requiring maintainers to write tests for each
> language's version of the functions. Would that address the
> maintenance burden?

With R we don't really test most of the functions beyond the simple
"callability". One the complex ones, that require some non-trivial
transformations of arguments, are fully tested.

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



OpenPGP_signature
Description: OpenPGP digital signature


Re: Broken rlang installation on AppVeyor

2020-10-09 Thread Maciej
Not a problem. Seems like there is more to this, than this discrepancy
though.

Here is draft PR https://github.com/apache/spark/pull/29991

This probably deserves a JIRA ticket, right?

On 10/9/20 1:48 PM, Hyukjin Kwon wrote:
> Thanks for reporting this. I think we should change to "x64". Can you
> open a PR to change?
>
> 2020년 10월 9일 (금) 오전 4:36, Maciej  <mailto:mszymkiew...@gmail.com>>님이 작성:
>
> Hi Everyone,
>
> I've been digging into AppVeyor test failures for
> https://github.com/apache/spark/pull/29978
>
>
> I see the following error
>
> [00:01:48] trying URL
> 'https://cloud.r-project.org/src/contrib/rlang_0.4.8.tar.gz'
> [00:01:48] Content type 'application/x-gzip' length 847517 bytes
> (827 KB)
> [00:01:48] ==
> [00:01:48] downloaded 827 KB
> [00:01:48] 
> [00:01:48] Warning in strptime(xx, f, tz = tz) :
> [00:01:48]   unable to identify current timezone 'C':
> [00:01:48] please set environment variable 'TZ'
> [00:01:49] * installing *source* package 'rlang' ...
> [00:01:49] ** package 'rlang' successfully unpacked and MD5 sums
> checked
> [00:01:49] ** using staged installation
> [00:01:49] ** libs
> [00:01:49] 
> [00:01:49] *** arch - i386
> [00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c capture.c -o capture.o
> [00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c export.c -o export.o
> [00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c internal.c -o internal.o
> [00:01:50] In file included from ./lib/rlang.h:74,
> [00:01:50]  from internal/arg.c:1,
> [00:01:50]  from internal.c:1:
> [00:01:50] internal/eval-tidy.c: In function 'rlang_tilde_eval':
> [00:01:50] ./lib/env.h:33:10: warning: 'top' may be used uninitialized
> in this function [-Wmaybe-uninitialized]
> [00:01:50]    return ENCLOS(env);
> [00:01:50]   ^~~
> [00:01:50] In file included from internal.c:8:
> [00:01:50] internal/eval-tidy.c:406:9: note: 'top' was declared here
> [00:01:50]    sexp* top;
> [00:01:50]  ^~~
> [00:01:50] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c lib.c -o lib.o
> [00:01:51] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c version.c -o version.o
> [00:01:52] C:/Rtools40/mingw64/bin/gcc -shared -s -static-libgcc -o
> rlang.dll tmp.def capture.o export.o internal.o lib.o version.o
> -LC:/R/bin/i386 -lR
> [00:01:52]
> 
> c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
> skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
> [00:01:52]
> 
> c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
> skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
> [00:01:52]
> 
> c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
> cannot find -lR
> [00:01:52] collect2.exe: error: ld returned 1 exit status
> [00:01:52] no DLL was created
> [00:01:52] ERROR: compilation failed for package 'rlang'
> [00:01:52] * removing 'C:/RLibrary/rlang'
> [00:01:52] 
> [00:01:52] The downloaded source packages are in
> [00:01:52]    
> 'C:\Users\appveyor\AppData\Local\Temp\1\Rtmp8qrryA\downloaded_packages'
> [00:01:52] Warning message:
> [00:01:52] In install.packages(c("knitr", "rmarkdown", "testthat",
> "e1071",  :
> [00:01:52]   installation of package 'rlang' had non-zero exit status
>
>
> This seems to be triggered by some changes between rlang 0.4.7 and
> 0.4.8
> (previous run with 0.4.7
> 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/35630069),
> but is there any reason why we seem to default to i386
> 
> (https://github.com/apache/spark/blob/c5f6af9f17498bb0ec393c16616f2d99e5d3ee3d/dev/appveyor-install-dependencies.ps1#L22)
> for R installation, while RTools are hard coded

Broken rlang installation on AppVeyor

2020-10-08 Thread Maciej
Hi Everyone,

I've been digging into AppVeyor test failures for
https://github.com/apache/spark/pull/29978


I see the following error

[00:01:48] trying URL
'https://cloud.r-project.org/src/contrib/rlang_0.4.8.tar.gz'
[00:01:48] Content type 'application/x-gzip' length 847517 bytes (827 KB)
[00:01:48] ==
[00:01:48] downloaded 827 KB
[00:01:48] 
[00:01:48] Warning in strptime(xx, f, tz = tz) :
[00:01:48]   unable to identify current timezone 'C':
[00:01:48] please set environment variable 'TZ'
[00:01:49] * installing *source* package 'rlang' ...
[00:01:49] ** package 'rlang' successfully unpacked and MD5 sums checked
[00:01:49] ** using staged installation
[00:01:49] ** libs
[00:01:49] 
[00:01:49] *** arch - i386
[00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
-I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
-mstackrealign -c capture.c -o capture.o
[00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
-I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
-mstackrealign -c export.c -o export.o
[00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
-I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
-mstackrealign -c internal.c -o internal.o
[00:01:50] In file included from ./lib/rlang.h:74,
[00:01:50]  from internal/arg.c:1,
[00:01:50]  from internal.c:1:
[00:01:50] internal/eval-tidy.c: In function 'rlang_tilde_eval':
[00:01:50] ./lib/env.h:33:10: warning: 'top' may be used uninitialized
in this function [-Wmaybe-uninitialized]
[00:01:50]    return ENCLOS(env);
[00:01:50]   ^~~
[00:01:50] In file included from internal.c:8:
[00:01:50] internal/eval-tidy.c:406:9: note: 'top' was declared here
[00:01:50]    sexp* top;
[00:01:50]  ^~~
[00:01:50] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
-I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
-mstackrealign -c lib.c -o lib.o
[00:01:51] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
-I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
-mstackrealign -c version.c -o version.o
[00:01:52] C:/Rtools40/mingw64/bin/gcc -shared -s -static-libgcc -o
rlang.dll tmp.def capture.o export.o internal.o lib.o version.o
-LC:/R/bin/i386 -lR
[00:01:52]
c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
[00:01:52]
c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
[00:01:52]
c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
cannot find -lR
[00:01:52] collect2.exe: error: ld returned 1 exit status
[00:01:52] no DLL was created
[00:01:52] ERROR: compilation failed for package 'rlang'
[00:01:52] * removing 'C:/RLibrary/rlang'
[00:01:52] 
[00:01:52] The downloaded source packages are in
[00:01:52]    
'C:\Users\appveyor\AppData\Local\Temp\1\Rtmp8qrryA\downloaded_packages'
[00:01:52] Warning message:
[00:01:52] In install.packages(c("knitr", "rmarkdown", "testthat",
"e1071",  :
[00:01:52]   installation of package 'rlang' had non-zero exit status


This seems to be triggered by some changes between rlang 0.4.7 and 0.4.8
(previous run with 0.4.7
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/35630069),
but is there any reason why we seem to default to i386
(https://github.com/apache/spark/blob/c5f6af9f17498bb0ec393c16616f2d99e5d3ee3d/dev/appveyor-install-dependencies.ps1#L22)
for R installation, while RTools are hard coded to x86_64 
(https://github.com/apache/spark/blob/c5f6af9f17498bb0ec393c16616f2d99e5d3ee3d/dev/appveyor-install-dependencies.ps1#L53)?


-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC




signature.asc
Description: OpenPGP digital signature


[DISCUSS][R] Adding magrittr as a dependency for SparkR

2020-09-30 Thread Maciej
Hi Everyone,

I'd like to start a discussion about possibility of adding magrittr
(https://magrittr.tidyverse.org/) as an explicit dependency for SparkR.
For those not familiar with the package, it provides a number small
utilities where the most important one is %>% function, similar to
pipe-forward (|>) in F# or thread-first macro (->) in Clojure. In other
words, it allows us to replace:

df <- createDataFrame(iris)

df_filtered <- filter(df, df$Sepal_Width > df$Petal_Length)

df_projected <- select(df_filtered, min(df$Sepal_Width - df$Petal_Length))

or


df_projected <- select(

  filter(createDataFrame(iris), column("Sepal_Width") >
column("Petal_Length")),

  min(column("Sepal_Width") - column("Petal_Length"))

)

with

df_projected <- createDataFrame(iris) %>% 
  filter(.$Sepal_Width > .$Petal_Length) %>%
  select(min(.$Sepal_Width - .$Petal_Length))

It is widely used (see reverse dependency section
https://cran.r-project.org/web/packages/magrittr/index.html), stable and
pretty much a core element of idiomatic R code these days.

Why we might want to add it:

  * Improve readability of SparkR examples which, subjectively speaking,
can look a bit archaic.
  * Reduce verbosity of SparkR codebase.


Possible risks:

  * It is additional dependency for CI pipeline.

A: magrittr is already a transitive dependency for SparkR tests (it
is required by testthat), its API is extremely stable and itself
requires no dependencies.
  * It is an additional dependency for SparkR installations.

A: Give widespread usage (over 1200 reverse imports, including some
of the most popular packages) it is probably of any, but minimal, R
installation.

While it's just anecdotal evidence, most of the SparkR applications
I've seen out there, already use magrittr.


Non-goals:

  * Supporting non-standard evaluation.


Thanks in advance for your input.

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Maciej
On my side, I'll try to identify any possible problems by the end of the
week or so (at somewhat crude inspection there is nothing unexpected or
particularly hard to resolve, but sometimes problem occur when you try
to refine things) and I'll post an update. Maybe we could take it from
there?

In general I'd expect that there will be some work to be done in the
following areas

  * Providing at least basic contributor guidelines ‒ as already
discussed annotations can server different purpose and take
different approach to certain issues. I expect this might evolve
over time, but it might be nice to have something to start.
  * CI pipeline. Codebase MyPy tests are usually not sufficient ‒ right
now I test against patched examples and using some additional data
driven case, but once combined, we can explore other options
(doctests are nice lead)

In short term there are also some upstream changes that haven't been
reflected in stubs master...


On 8/27/20 10:24 PM, Driesprong, Fokko wrote:
> . Any action points that we can define and that I can help on? I'm
> fine with taking the route that Hyukjin suggests :)
>
-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Maciej
Oh, this is probably because of how annotations are handled.

In general stubs take preference over inline annotations and are
considered the only source of type hints, unless packaged is marked as
partially typed (https://www.python.org/dev/peps/pep-0561/#id21). In
such case however is all-or-nothing on the module basis.

Nonetheless ecosystem is still somewhat, so different tools might choose
different approach.

On 8/27/20 10:24 PM, Driesprong, Fokko wrote:
> Looking at it a second time, I think it is only mypy that's complaining:
>
> MacBook-Pro-van-Fokko:spark fokkodriesprong$ git diff
>
> *diff --git a/python/pyspark/accumulators.pyi
> b/python/pyspark/accumulators.pyi*
>
> *index f60de25704..6eafe46a46 100644*
>
> *--- a/python/pyspark/accumulators.pyi*
>
> *+++ b/python/pyspark/accumulators.pyi*
>
> @@ -30,7 +30,7 @@U = TypeVar("U", bound=SupportsIAdd)
>
>  
>
>  import socketserver as SocketServer
>
>  
>
> -_accumulatorRegistry: Dict = {}
>
> +# _accumulatorRegistry: Dict = {}
>
>  
>
>  class Accumulator(Generic[T]):
>
>      aid: int
>
>
> MacBook-Pro-van-Fokko:spark fokkodriesprong$ ./dev/lint-python 
>
> starting python compilation test...
>
> python compilation succeeded.
>
>
> starting pycodestyle test...
>
> pycodestyle checks passed.
>
>
> starting flake8 test...
>
> flake8 checks passed.
>
>
> starting mypy test...
>
> mypy checks failed:
>
> python/pyspark/worker.py:34: error: Module 'pyspark.accumulators' has
> no attribute '_accumulatorRegistry'
>
> Found 1 error in 1 file (checked 185 source files)
>
> 1
>
>
> Sorry for the noise, just my excitement to see this happen. Any action
> points that we can define and that I can help on? I'm fine with taking
> the route that Hyukjin suggests :)
>
> Cheers, Fokko
>
> Op do 27 aug. 2020 om 18:45 schreef Maciej  <mailto:mszymkiew...@gmail.com>>:
>
> Well, technically speaking annotation and actual are not the same
> thing. Many parts of Spark API might require heavy overloads to
> either capture relationships between arguments (for example in
> case of ML) or to capture at least rudimentary relationships
> between inputs and outputs (i.e. udfs).
>
> Just saying...
>
>
>
> On 8/27/20 6:09 PM, Driesprong, Fokko wrote:
>> Also, it is very cumbersome to add everything to the pyi file. In
>> practice, this means copying the method definition from the py
>> file and paste it into the pyi file. This hurts my developers'
>> heart, as it violates the DRY principle. 
>
>
>>
>> I see many big projects using regular annotations: 
>> -
>> Pandas: 
>> https://github.com/pandas-dev/pandas/blob/master/pandas/io/parquet.py#L51
>
> That's probably not a good example, unless something changed
> significantly lately. The last time I participated in the
> discussion Pandas didn't type check and had no clear timeline for
> advertising annotations.
>
>
> -- 
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> Keybase: https://keybase.io/zero323
> Gigs: https://www.codementor.io/@zero323
> PGP: A30CEF0C31A501EC
>

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Maciej
Well, technically speaking annotation and actual are not the same thing.
Many parts of Spark API might require heavy overloads to either capture
relationships between arguments (for example in case of ML) or to
capture at least rudimentary relationships between inputs and outputs
(i.e. udfs).

Just saying...



On 8/27/20 6:09 PM, Driesprong, Fokko wrote:
> Also, it is very cumbersome to add everything to the pyi file. In
> practice, this means copying the method definition from the py file
> and paste it into the pyi file. This hurts my developers' heart, as it
> violates the DRY principle. 


>
> I see many big projects using regular annotations: 
> -
> Pandas: 
> https://github.com/pandas-dev/pandas/blob/master/pandas/io/parquet.py#L51

That's probably not a good example, unless something changed
significantly lately. The last time I participated in the discussion
Pandas didn't type check and had no clear timeline for advertising
annotations.


-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Maciej
That doesn't sound right. Would it be a problem for you to provide
reproducible example?

On 8/27/20 6:09 PM, Driesprong, Fokko wrote:
> Today I've updated [SPARK-17333][PYSPARK] Enable mypy on the
> repository <https://github.com/apache/spark/pull/29180/> and while
> doing so I've noticed that all the methods that aren't in the pyi file
> are *unable to be called from other python files*. I was unaware of
> this effect of the pyi files. As soon as you create the files, all the
> methods are shielded from external access. Feels like going back to
> cpp :'(
>

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz
Indeed, though the possible advantage is that in theory, you can have
different release cycle than for the main repo (I am not sure if that's
feasible in practice or if that was the intention).

I guess all depends on how we envision the future of annotations
(including, but not limited to, how conservative we want to be in the
future). Which is probably something that should be discussed here.

On 8/4/20 11:06 PM, Felix Cheung wrote:
> So IMO maintaining outside in a separate repo is going to be harder.
> That was why I asked.
>
>
>  
> ----
> *From:* Maciej Szymkiewicz 
> *Sent:* Tuesday, August 4, 2020 12:59 PM
> *To:* Sean Owen
> *Cc:* Felix Cheung; Hyukjin Kwon; Driesprong, Fokko; Holden Karau;
> Spark Dev List
> *Subject:* Re: [PySpark] Revisiting PySpark type annotations
>  
>
> On 8/4/20 9:35 PM, Sean Owen wrote
> > Yes, but the general argument you make here is: if you tie this
> > project to the main project, it will _have_ to be maintained by
> > everyone. That's good, but also exactly I think the downside we want
> > to avoid at this stage (I thought?) I understand for some
> > undertakings, it's just not feasible to start outside the main
> > project, but is there no proof of concept even possible before taking
> > this step -- which more or less implies it's going to be owned and
> > merged and have to be maintained in the main project.
>
>
> I think we have a bit different understanding here ‒ I believe we have
> reached a conclusion that maintaining annotations within the project is
> OK, we only differ when it comes to specific form it should take.
>
> As of POC ‒ we have stubs, which have been maintained over three years
> now and cover versions between 2.3 (though these are fairly limited) to,
> with some lag, current master.  There is some evidence there are used in
> the wild
> (https://github.com/zero323/pyspark-stubs/network/dependents?package_id=UGFja2FnZS02MzU1MTc4Mg%3D%3D),
> there are a few contributors
> (https://github.com/zero323/pyspark-stubs/graphs/contributors) and at
> least some use cases (https://stackoverflow.com/q/40163106/). So,
> subjectively speaking, it seems we're already beyond POC.
>
> -- 
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> Keybase: https://keybase.io/zero323
> Gigs: https://www.codementor.io/@zero323
> PGP: A30CEF0C31A501EC
>
>
-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz

On 8/4/20 9:35 PM, Sean Owen wrote
> Yes, but the general argument you make here is: if you tie this
> project to the main project, it will _have_ to be maintained by
> everyone. That's good, but also exactly I think the downside we want
> to avoid at this stage (I thought?) I understand for some
> undertakings, it's just not feasible to start outside the main
> project, but is there no proof of concept even possible before taking
> this step -- which more or less implies it's going to be owned and
> merged and have to be maintained in the main project.


I think we have a bit different understanding here ‒ I believe we have
reached a conclusion that maintaining annotations within the project is
OK, we only differ when it comes to specific form it should take.

As of POC ‒ we have stubs, which have been maintained over three years
now and cover versions between 2.3 (though these are fairly limited) to,
with some lag, current master.  There is some evidence there are used in
the wild
(https://github.com/zero323/pyspark-stubs/network/dependents?package_id=UGFja2FnZS02MzU1MTc4Mg%3D%3D),
there are a few contributors
(https://github.com/zero323/pyspark-stubs/graphs/contributors) and at
least some use cases (https://stackoverflow.com/q/40163106/). So,
subjectively speaking, it seems we're already beyond POC.

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC




signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz
*First of all why ASF ownership? *

For the project of this size maintaining high quality (it is not hard to
use stubgen or monkeytype, but resulting annotations are rather
simplistic) annotations independent of the actual codebase is far from
trivial. For starters, changes which are mostly transparent to the final
user (like pyspark.ml changes in 3.0 / 3.1) might require significant
changes in the annotations. Additionally some signature changes are
rather hard to track and such separation can easily lead to divergence.

Additionally, annotations are as much about describing facts, as showing
intended usage (the simplest use case is documenting argument
dependencies). This makes process of annotation rather subjective and
requires good understanding of author's intention.

Finally, annotation-friendly signatures require conscious decisions (see
for example https://github.com/python/mypy/issues/5621).

Overall, ASF ownership is probably the best way to ensure long-term
sustainability and quality of annotations.

*Now, why separate repo?*

Based on the discussion so far it is clear that there is no consensus
about using inline annotations. There are three other options:

  * Stub files packaged alongside actual code.
  * Separate project within root, packaged separately.
  * Separate repository, packaged separately.

As already pointed out here and in the comments to
https://github.com/apache/spark/pull/29180, annotations are still
somewhat unstable. Ecosystem evolves quickly and new features, some
having potential for fundamental change in the way how we annotate code.

Therefore, it might be beneficial to maintain subproject (out of lack of
a better word), that can evolve faster than the code that is annotate.

While I have no strong opinion about this part, it is definitely a
relatively unobtrusive way of bringing code and annotations closer
together.

On 8/4/20 7:44 PM, Sean Owen wrote:

> Maybe more specifically, why an ASF repo?
>
> On Tue, Aug 4, 2020 at 11:45 AM Felix Cheung  
> wrote:
>> What would be the reason for separate git repo?
>>
>> 
>> From: Hyukjin Kwon 
>> Sent: Monday, August 3, 2020 1:58:55 AM
>> To: Maciej Szymkiewicz 
>> Cc: Driesprong, Fokko ; Holden Karau 
>> ; Spark Dev List 
>> Subject: Re: [PySpark] Revisiting PySpark type annotations
>>
>> Okay, seems like we can create a separate repo as apache/spark? e.g.) 
>> https://issues.apache.org/jira/browse/INFRA-20470
>> We can also think about porting the files as are.
>> I will try to have a short sync with the author Maciej, and share what we 
>> discussed offline.
>>
-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
W dniu środa, 22 lipca 2020 Driesprong, Fokko 
napisał(a):

> That's probably one-time overhead so it is not a big issue.  In my
> opinion, a bigger one is possible complexity. Annotations tend to introduce
> a lot of cyclic dependencies in Spark codebase. This can be addressed, but
> don't look great.
>
>
> This is not true (anymore). With Python 3.6 you can add string annotations
> -> 'DenseVector', and in the future with Python 3.7 this is fixed by having
> postponed evaluation: https://www.python.org/dev/peps/pep-0563/
>

As far as I recall linked PEP addresses backrferences not cyclic
dependencies, which weren't a big issue in the first place

What I mean is a actually cyclic stuff - for example pyspark.context
depends on pyspark.rdd and the other way around. These dependencies are not
explicit at he moment.



> Merging stubs into project structure from the other hand has almost no
> overhead.
>
>
> This feels awkward to me, this is like having the docstring in a separate
> file. In my opinion you want to have the signatures and the functions
> together for transparency and maintainability.
>
>
I guess that's the matter of preference. From maintainability perspective
it is actually much easier to have separate objects.

For example there are different types of objects that are required for
meaningful checking, which don't really exist in real code (protocols,
aliases, code generated signatures fo let complex overloads) as well as
some monkey patched entities

Additionally it is often easier to see inconsistencies when typing is
separate.

However, I am not implying that this should be a persistent state.

In general I see two non breaking paths here.

 - Merge pyspark-stubs a separate subproject within main spark repo and
keep it in-sync there with common CI pipeline and transfer ownership of
pypi package to ASF
- Move stubs directly into python/pyspark and then apply individual stubs
to .modules of choice.

Of course, the first proposal could be an initial step for the latter one.


>
> I think DBT is a very nice project where they use annotations very well:
> https://github.com/fishtown-analytics/dbt/blob/dev/marian-
> anderson/core/dbt/graph/queue.py
>
> Also, they left out the types in the docstring, since they are available
> in the annotations itself.
>
>

> In practice, the biggest advantage is actually support for completion, not
> type checking (which works in simple cases).
>
>
> Agreed.
>
> Would you be interested in writing up the Outreachy proposal for work on
> this?
>
>
> I would be, and also happy to mentor. But, I think we first need to agree
> as a Spark community if we want to add the annotations to the code, and in
> which extend.
>





> At some point (in general when things are heavy in generics, which is the
> case here), annotations become somewhat painful to write.
>
>
> That's true, but that might also be a pointer that it is time to refactor
> the function/code :)
>

That might the case, but it is more often a matter capturing useful
properties combined with requirement to keep things in sync with Scala
counterparts.



> For now, I tend to think adding type hints to the codes make it difficult
> to backport or revert and more difficult to discuss about typing only
> especially considering typing is arguably premature yet.
>
>
> This feels a bit weird to me, since you want to keep this in sync right?
> Do you provide different stubs for different versions of Python? I had to
> look up the literals: https://www.python.org/dev/peps/pep-0586/
>

I think it is more about portability between Spark versions

>
>
> Cheers, Fokko
>

> Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz <
> mszymkiew...@gmail.com>:
>
>>
>> On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
>> > For now, I tend to think adding type hints to the codes make it
>> > difficult to backport or revert and
>> > more difficult to discuss about typing only especially considering
>> > typing is arguably premature yet.
>>
>> About being premature ‒ since typing ecosystem evolves much faster than
>> Spark it might be preferable to keep annotations as a separate project
>> (preferably under AST / Spark umbrella). It allows for faster iterations
>> and supporting new features (for example Literals proved to be very
>> useful), without waiting for the next Spark release.
>>
>> --
>> Best regards,
>> Maciej Szymkiewicz
>>
>> Web: https://zero323.net
>> Keybase: https://keybase.io/zero323
>> Gigs: https://www.codementor.io/@zero323
>> PGP: A30CEF0C31A501EC
>>
>>
>>

-- 

Best regards,
Maciej Szymkiewicz


Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz

On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
> For now, I tend to think adding type hints to the codes make it
> difficult to backport or revert and
> more difficult to discuss about typing only especially considering
> typing is arguably premature yet.

About being premature ‒ since typing ecosystem evolves much faster than
Spark it might be preferable to keep annotations as a separate project
(preferably under AST / Spark umbrella). It allows for faster iterations
and supporting new features (for example Literals proved to be very
useful), without waiting for the next Spark release.

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC




signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz

On 7/21/20 9:40 PM, Holden Karau wrote:
> Yeah I think this could be a great project now that we're only Python
> 3.5+. One potential is making this an Outreachy project to get more
> folks from different backgrounds involved in Spark.

I am honestly not sure if that's really the case.

At the moment I maintain almost complete set of annotations for the
project. These could  be ported in a single step with relatively little
effort.

As of the further maintenance ‒ this will have to be done along the
codebase changes to keep things in sync, so if outreach means
low-hanging-fruit, it is uniquely to serve this purpose.

Additionally, there are at least two considerations:

  * At some point (in general when things are heavy in generics, which
is the case here), annotations become somewhat painful to write.
  * In ideal case API design has to be linked (to reasonable extent)
with annotations design ‒ not every signature can be annotated in a
meaningful way, which is already a problem with some chunks of Spark
code.

>
> On Tue, Jul 21, 2020 at 12:33 PM Driesprong, Fokko
>  wrote:
>
> Since we've recently dropped support for Python <=3.5
> <https://github.com/apache/spark/pull/28957>, I think it would be
> nice to add support for type annotations. Having this in the main
> repository allows us to do type checking using MyPy
> <http://mypy-lang.org/> in the CI itself.
>
> This is now handled by the Stub
> file: https://www.python.org/dev/peps/pep-0484/#stub-files However
> I think it is nicer to integrate the types with the code itself to
> keep everything in sync, and make it easier for the people who
> work on the codebase itself. A first step would be to move the
> stubs into the codebase. First step would be to cover the public
> API which is the most important one. Having the types with the
> code itself makes it much easier to understand. For example, if
> you can supply a str or column
> here: 
> https://github.com/apache/spark/pull/29122/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2486
>
> One of the implications would be that future PR's on Python should
> cover annotations on the public API's. Curious what the rest of
> the community thinks.
>
> Cheers, Fokko
>
>
>
>
>
>
>
>
>
> Op di 21 jul. 2020 om 20:04 schreef zero323
> mailto:mszymkiew...@gmail.com>>:
>
> Given a discussion related to  SPARK-32320 PR
> <https://github.com/apache/spark/pull/29122>   I'd like to
> resurrect this
> thread. Is there any interest in migrating annotations to the main
> repository?
>
>
>
> --
> Sent from:
> http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> <mailto:dev-unsubscr...@spark.apache.org>
>
>
>
> -- 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark,
> etc.): https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
 it easier for
> the people who work on the codebase itself. A first step
> would be to move the stubs into the codebase. First step
> would be to cover the public API which is the most
> important one. Having the types with the code itself makes
> it much easier to understand. For example, if you can
> supply a str or column
> here: 
> https://github.com/apache/spark/pull/29122/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2486
>
> One of the implications would be that future PR's on
> Python should cover annotations on the public API's.
> Curious what the rest of the community thinks.
>
> Cheers, Fokko
>
>
>
>
>
>
>
>
>
> Op di 21 jul. 2020 om 20:04 schreef zero323
> mailto:mszymkiew...@gmail.com>>:
>
> Given a discussion related to  SPARK-32320 PR
> <https://github.com/apache/spark/pull/29122>   I'd
> like to resurrect this
> thread. Is there any interest in migrating annotations
> to the main
> repository?
>
>
>
> --
> Sent from:
> http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> 
> -
> To unsubscribe e-mail:
>     dev-unsubscr...@spark.apache.org
> <mailto:dev-unsubscr...@spark.apache.org>
>
>
>
> -- 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark,
> etc.): https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



signature.asc
Description: OpenPGP digital signature


Re: Scala vs PySpark Inconsistency: SQLContext/SparkSession access from DataFrame/DataSet

2020-03-18 Thread Maciej Szymkiewicz
Hi Ben,

Please note that `_sc` is not a SQLContext. It is a SparkContext, which
is used primarily for internal calls.

SQLContext is exposed through `sql_ctx`
(https://github.com/apache/spark/blob/8bfaa62f2fcc942dd99a63b20366167277bce2a1/python/pyspark/sql/dataframe.py#L80)

On 3/17/20 5:53 PM, Ben Roling wrote:
> I tried this on the users mailing list but didn't get traction.  It's
> probably more appropriate here anyway.
>
> I've noticed that DataSet.sqlContext is public in Scala but the
> equivalent (DataFrame._sc) in PySpark is named as if it should be
> treated as private.
>
> Is this intentional?  If so, what's the rationale?  If not, then it
> feels like a bug and DataFrame should have some form of public access
> back to the context/session.  I'm happy to log the bug but thought I
> would ask here first.  Thanks!

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: C095AA7F33E6123A




signature.asc
Description: OpenPGP digital signature


Re: Apache Spark Docker image repository

2020-02-06 Thread Maciej Szymkiewicz

On 2/6/20 2:53 AM, Jiaxin Shan wrote:
> I will vote for this. It's pretty helpful to have managed Spark
> images. Currently, user have to download Spark binaries and build
> their own. 
> With this supported, user journey will be simplified and we only need
> to build an application image on top of base image provided by community. 
>
> Do we have different OS or architecture support? If not, there will be
> Java, R, Python total 3 container images for every release.

Well, technically speaking there are 3 non-deprecated Python versions (4
if you count PyPy), 3 non-deprecated R versions, luckily only one
non-deprecated Scala version and possible variations of JDK. Latest and
greatest are not necessarily the most popular and useful.

That's on top of native dependencies like BLAS (possibly in different
flavors and accounting for netlib-java break in development), libparquet
and libarrow.

Not all of these must be generated, but complexity grows pretty fast,
especially when native dependencies are involved. It gets worse if you
actually want to support Spark builds and tests ‒ for example to build
and fully test SparkR builds you need half of the universe including
some awkward LaTex style patches and such
(https://github.com/zero323/sparkr-build-sandbox).

End even without that images tend to grow pretty large.

Few years back me and Elias <https://github.com/eliasah> experimented
with the idea of generating different sets of Dockerfiles ‒
https://github.com/spark-in-a-box/spark-in-a-box ‒ intended use cases
where rather different (mostly quick setup of testbeds) though. The
project has been inactive for a while, with some private patches to fit
this or that use case.

>
> On Wed, Feb 5, 2020 at 2:56 PM Sean Owen  <mailto:sro...@gmail.com>> wrote:
>
> What would the images have - just the image for a worker?
> We wouldn't want to publish N permutations of Python, R, OS, Java,
> etc.
> But if we don't then we make one or a few choices of that combo, and
> then I wonder how many people find the image useful.
> If the goal is just to support Spark testing, that seems fine and
> tractable, but does it need to be 'public' as in advertised as a
> convenience binary? vs just some image that's hosted somewhere for the
> benefit of project infra.
>
> On Wed, Feb 5, 2020 at 12:16 PM Dongjoon Hyun
> mailto:dongjoon.h...@gmail.com>> wrote:
> >
> > Hi, All.
> >
> > From 2020, shall we have an official Docker image repository as
> an additional distribution channel?
> >
> > I'm considering the following images.
> >
> >     - Public binary release (no snapshot image)
> >     - Public non-Spark base image (OS + R + Python)
> >       (This can be used in GitHub Action Jobs and Jenkins K8s
> Integration Tests to speed up jobs and to have more stabler
> environments)
> >
> > Bests,
> > Dongjoon.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> <mailto:dev-unsubscr...@spark.apache.org>
>
>
>
> -- 
> Best Regards!
> Jiaxin Shan
> Tel:  412-230-7670
> Address: 470 2nd Ave S, Kirkland, WA
>
-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: C095AA7F33E6123A



signature.asc
Description: OpenPGP digital signature


Re: [DISCUSS] PostgreSQL dialect

2019-11-26 Thread Maciej Szymkiewicz
ere is no standard)
>
> We should still add PostgreSQL features that Spark doesn't have, or
> Spark's behavior violates SQL standard. But for others, let's just
> update the answer files of PostgreSQL tests.
>
> Any comments are welcome!
>
> Thanks,
> Wenchen

-- 
Best regards,
Maciej



Re: [DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-30 Thread Maciej Szymkiewicz
Could we upgrade to PyPy3.6 v7.2.0?

On 10/30/19 9:45 PM, Shane Knapp wrote:
> one quick thing:  we currently test against python2.7, 3.6 *and*
> pypy2.5.1 (python2.7).
>
> what are our plans for pypy?
>
>
> On Wed, Oct 30, 2019 at 12:26 PM Dongjoon Hyun
> mailto:dongjoon.h...@gmail.com>> wrote:
>
> Thank you all. I made a PR for that.
>
> https://github.com/apache/spark/pull/26326
>
> On Tue, Oct 29, 2019 at 5:45 AM Takeshi Yamamuro
> mailto:linguin@gmail.com>> wrote:
>
> +1, too.
>
> On Tue, Oct 29, 2019 at 4:16 PM Holden Karau
> mailto:hol...@pigscanfly.ca>> wrote:
>
> +1 to deprecating but not yet removing support for 3.6
>
> On Tue, Oct 29, 2019 at 3:47 AM Shane Knapp
> mailto:skn...@berkeley.edu>> wrote:
>
> +1 to testing the absolute minimum number of python
> variants as possible.  ;)
>
> On Mon, Oct 28, 2019 at 7:46 PM Hyukjin Kwon
> mailto:gurwls...@gmail.com>> wrote:
>
> +1 from me as well.
>
> 2019년 10월 29일 (화) 오전 5:34, Xiangrui Meng
>  <mailto:m...@databricks.com>>님이 작성:
>
> +1. And we should start testing 3.7 and maybe
> 3.8 in Jenkins.
>
> On Thu, Oct 24, 2019 at 9:34 AM Dongjoon Hyun
>  <mailto:dongjoon.h...@gmail.com>> wrote:
>
> Thank you for starting the thread.
>
> In addition to that, we currently are
> testing Python 3.6 only in Apache Spark
> Jenkins environment.
>
> Given that Python 3.8 is already out and
> Apache Spark 3.0.0 RC1 will start next January
> (https://spark.apache.org/versioning-policy.html),
> I'm +1 for the deprecation (Python < 3.6)
> at Apache Spark 3.0.0.
>
> It's just a deprecation to prepare the
> next-step development cycle.
> Bests,
> Dongjoon.
>
>
> On Thu, Oct 24, 2019 at 1:10 AM Maciej
> Szymkiewicz  <mailto:mszymkiew...@gmail.com>> wrote:
>
> Hi everyone,
>
> While deprecation of Python 2 in 3.0.0
> has been announced
> 
> <https://spark.apache.org/news/plan-for-dropping-python-2-support.html>,
> there is no clear statement about
> specific continuing support of
> different Python 3 version.
>
> Specifically:
>
>   * Python 3.4 has been retired this year.
>   * Python 3.5 is already in the
> "security fixes only" mode and
> should be retired in the middle of
> 2020.
>
> Continued support of these two blocks
> adoption of many new Python features
>     (PEP 468)  and it is hard to justify
> beyond 2020.
>
> Should these two be deprecated in
> 3.0.0 as well?
>
> -- 
> Best regards,
> Maciej
>
>
>
> -- 
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
> -- 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark,
> etc.): https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
>
> -- 
> ---
> Takeshi Yamamuro
>
>
>
> -- 
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu

-- 
Best regards,
Maciej



[DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-24 Thread Maciej Szymkiewicz
Hi everyone,

While deprecation of Python 2 in 3.0.0 has been announced
<https://spark.apache.org/news/plan-for-dropping-python-2-support.html>,
there is no clear statement about specific continuing support of
different Python 3 version.

Specifically:

  * Python 3.4 has been retired this year.
  * Python 3.5 is already in the "security fixes only" mode and should
be retired in the middle of 2020.

Continued support of these two blocks adoption of many new Python
features (PEP 468)  and it is hard to justify beyond 2020.

Should these two be deprecated in 3.0.0 as well?

-- 
Best regards,
Maciej



Is SPARK-9961 is still relevant?

2019-10-05 Thread Maciej Szymkiewicz
Hi everyone,

I just encountered SPARK-9961
<https://issues.apache.org/jira/browse/SPARK-9961> which seems to be
largely outdated today.

In the latest releases majority of models computes different evaluation
metrics exposed later through corresponding summaries.  At the same time
such defaultEvaluator has little potential of being integrated with
Spark ML tuning tools, which depend on  input and output columns, not
input estimator.

Assuming that the answer is negative, and this ticket can be closed,
should DeveloperApi annotations removed? If I understand this ticket
correctly, planned defaultEvaluator was the primary reason to use such
annotation there.

-- 
Best regards,
Maciej



Re: Introduce FORMAT clause to CAST with SQL:2016 datetime patterns

2019-03-20 Thread Maciej Szymkiewicz
One concern here is introduction of second formatting convention.

This can not only cause confusion among users, but also result in some hard
to spot bugs, when wrong format, with different meaning, is used. This is
already a problem for Python and R users, with week year and months /
minutes mixups popping out from time to time.

On Wed, 20 Mar 2019 at 10:53, Gabor Kaszab  wrote:

> Hey Hive and Spark communities,
> [dev@impala in cc]
>
> I'm working on an Impala improvement to introduce the FORMAT clause within
> CAST() operator and to implement ISO SQL:2016 datetime pattern support for
> this new FORMAT clause:
> https://issues.apache.org/jira/browse/IMPALA-4018
>
> One example of the new format:
> SELECT(CAST("2018-01-02 09:15" as timestamp FORMAT "-MM-DD HH12:MI"));
>
> I have put together a document for my proposal of how to do this in Impala
> and what patterns we plan to support to cover the SQL standard and what
> additional patterns we propose to support on top of the standard's
> recommendation.
>
> https://docs.google.com/document/d/1V7k6-lrPGW7_uhqM-FhKl3QsxwCRy69v2KIxPsGjc1k/
>
> The reason I share this with the Hive and Spark communities because I feel
> it would be nice that these systems were in line with the Impala
> implementation. So I'd like to involve these communities to the planning
> phase of this task so that everyone can share their opinion about whether
> this make sense in the proposed form.
> Eventually I feel that each of these systems should have the SQL:2016
> datetime format and I think it would be nice to have it with a newly
> introduced CAST(..FORMAT..) clause.
>
> I would like to ask members from both Hive and Spark to take a look at my
> proposal and share their opinion from their own component's perspective. If
> we get on the same page I'll eventually open Jiras to cover this
> improvement for each mentioned systems.
>
> Cheers,
> Gabor
>
>
>
>

-- 

Regards,
Maciej


Re: Feature request: split dataset based on condition

2019-02-03 Thread Maciej Szymkiewicz
If the goal is to split the output, then `DataFrameWriter.partitionBy`
should do what you need, and no additional methods are required. If not you
can also check Silex's implementation muxPartitions (see
https://stackoverflow.com/a/37956034), but the applications are rather
limited, due to high resource usage.

On Sun, 3 Feb 2019 at 15:41, Sean Owen  wrote:

> I don't think Spark supports this model, where N inputs depending on
> parent are computed once at the same time. You can of course cache the
> parent and filter N times and do the same amount of work. One problem is,
> where would the N inputs live? they'd have to be stored if not used
> immediately, and presumably in any use case, only one of them would be used
> immediately. If you have a job that needs to split records of a parent into
> N subsets, and then all N subsets are used, you can do that -- you are just
> transforming the parent to one child that has rows with those N splits of
> each input row, and then consume that. See randomSplit() for maybe the best
> case, where it still produce N Datasets but can do so efficiently because
> it's just a random sample.
>
> On Sun, Feb 3, 2019 at 12:20 AM Moein Hosseini  wrote:
>
>> I don't consider it as method to apply filtering multiple time, instead
>> use it as semi-action not just transformation. Let's think that we have
>> something like map-partition which accept multiple lambda that each one
>> collect their ROW for their dataset (or something like it). Is it possible?
>>
>> On Sat, Feb 2, 2019 at 5:59 PM Sean Owen  wrote:
>>
>>> I think the problem is that can't produce multiple Datasets from one
>>> source in one operation - consider that reproducing one of them would mean
>>> reproducing all of them. You can write a method that would do the filtering
>>> multiple times but it wouldn't be faster. What do you have in mind that's
>>> different?
>>>
>>> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini 
>>> wrote:
>>>
>>>> I've seen many application need to split dataset to multiple datasets
>>>> based on some conditions. As there is no method to do it in one place,
>>>> developers use *filter *method multiple times. I think it can be
>>>> useful to have method to split dataset based on condition in one iteration,
>>>> something like *partition* method of scala (of-course scala partition
>>>> just split list into two list, but something more general can be more
>>>> useful).
>>>> If you think it can be helpful, I can create Jira issue and work on it
>>>> to send PR.
>>>>
>>>> Best Regards
>>>> Moein
>>>>
>>>> --
>>>>
>>>> Moein Hosseini
>>>> Data Engineer
>>>> mobile: +98 912 468 1859 <+98+912+468+1859>
>>>> site: www.moein.xyz
>>>> email: moein...@gmail.com
>>>> [image: linkedin] <https://www.linkedin.com/in/moeinhm>
>>>> [image: twitter] <https://twitter.com/moein7tl>
>>>>
>>>>
>>
>> --
>>
>> Moein Hosseini
>> Data Engineer
>> mobile: +98 912 468 1859 <+98+912+468+1859>
>> site: www.moein.xyz
>> email: moein...@gmail.com
>> [image: linkedin] <https://www.linkedin.com/in/moeinhm>
>> [image: twitter] <https://twitter.com/moein7tl>
>>
>>

-- 

Regards,
Maciej


[PySpark] Revisiting PySpark type annotations

2019-01-25 Thread Maciej Szymkiewicz
Hello everyone,

I'd like to revisit the topic of adding PySpark type annotations in 3.0. It
has been discussed before (
http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html
and
http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html)
and is tracked by SPARK-17333 (
https://issues.apache.org/jira/browse/SPARK-17333). Is there any consensus
here?

In the spirit of full disclosure I am trying to decide if, and if yes to
what extent, migrate my stub package (
https://github.com/zero323/pyspark-stubs) to 3.0 and beyond. Maintaining
such package is relatively time consuming (not being active PySpark user
anymore, it is the least priority for me at the moment) and if there any
official plans to make it obsolete, it would be a valuable information for
me.

If there are no plans to add native annotations to PySpark, I'd like to use
this opportunity to ask PySpark commiters, to drop by and open issue (
https://github.com/zero323/pyspark-stubs/issues)  when new methods are
introduced, or there are changes in the existing API (PR's are of course
welcomed as well). Thanks in advance.

-- 
Best,
Maciej


Re: Documentation of boolean column operators missing?

2018-10-23 Thread Maciej Szymkiewicz
Even if these were documented Sphinx doesn't include dunder methods by
default (with exception to __init__). There is :special-members: option
which could be passed to, for example, autoclass.

On Tue, 23 Oct 2018 at 21:32, Sean Owen  wrote:

> (& and | are both logical and bitwise operators in Java and Scala, FWIW)
>
> I don't see them in the python docs; they are defined in column.py but
> they don't turn up in the docs. Then again, they're not documented:
>
> ...
> __and__ = _bin_op('and')
> __or__ = _bin_op('or')
> __invert__ = _func_op('not')
> __rand__ = _bin_op("and")
> __ror__ = _bin_op("or")
> ...
>
> I don't know if there's a good reason for it, but go ahead and doc
> them if they can be.
> While I suspect their meaning is obvious once it's clear they aren't
> the bitwise operators, that part isn't obvious/ While it matches
> Java/Scala/Scala-Spark syntax, and that's probably most important, it
> isn't typical for python.
>
> The comments say that it is not possible to overload 'and' and 'or',
> which would have been more natural.
>
> On Tue, Oct 23, 2018 at 2:20 PM Nicholas Chammas
>  wrote:
> >
> > Also, to clarify something for folks who don't work with PySpark: The
> boolean column operators in PySpark are completely different from those in
> Scala, and non-obvious to boot (since they overload Python's _bitwise_
> operators). So their apparent absence from the docs is surprising.
> >
> > On Tue, Oct 23, 2018 at 3:02 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
> >>
> >> So it appears then that the equivalent operators for PySpark are
> completely missing from the docs, right? That’s surprising. And if there
> are column function equivalents for |, &, and ~, then I can’t find those
> either for PySpark. Indeed, I don’t think such a thing is possible in
> PySpark. (e.g. (col('age') > 0).and(...))
> >>
> >> I can file a ticket about this, but I’m just making sure I’m not
> missing something obvious.
> >>
> >>
> >> On Tue, Oct 23, 2018 at 2:50 PM Sean Owen  wrote:
> >>>
> >>> Those should all be Column functions, really, and I see them at
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column
> >>>
> >>> On Tue, Oct 23, 2018, 12:27 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
> 
>  I can’t seem to find any documentation of the &, |, and ~ operators
> for PySpark DataFrame columns. I assume that should be in our docs
> somewhere.
> 
>  Was it always missing? Am I just missing something obvious?
> 
>  Nick
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Python friendly API for Spark 3.0

2018-09-15 Thread Maciej Szymkiewicz
For the reference I raised question of Python 2 support before -
http://apache-spark-developers-list.1001551.n3.nabble.com/Future-of-the-Python-2-support-td20094.html



On Sat, 15 Sep 2018 at 15:14, Alexander Shorin  wrote:

> What's the release due for Apache Spark 3.0? Will it be tomorrow or
> somewhere at the middle of 2019 year?
>
> I think we shouldn't care much about Python 2.x today, since quite
> soon it support turns into pumpkin. For today's projects I hope nobody
> takes into account support of 2.7 unless there is some legacy still to
> carry on, but do we want to take that baggage into Apache Spark 3.x
> era? The next time you may drop it would be only 4.0 release because
> of breaking change.
>
> --
> ,,,^..^,,,
> On Sat, Sep 15, 2018 at 2:21 PM Maciej Szymkiewicz
>  wrote:
> >
> > There is no need to ditch Python 2. There are basically two options
> >
> > Use stub files and limit yourself to support only Python 3 support.
> Python 3 users benefit from type hints, Python 2 users don't, but no core
> functionality is affected. This is the approach I've used with
> https://github.com/zero323/pyspark-stubs/.
> > Use comment based inline syntax or stub files and don't use backward
> incompatible features (primarily typing module -
> https://docs.python.org/3/library/typing.html). Both Python 2 and 3 is
> supported, but more advanced components are not. Small win for Python 2
> users, moderate loss for Python 3 users.
> >
> >
> >
> > On Sat, 15 Sep 2018 at 02:38, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
> >>
> >> Do we need to ditch Python 2 support to provide type hints? I don’t
> think so.
> >>
> >> Python lets you specify typing stubs that provide the same benefit
> without forcing Python 3.
> >>
> >> 2018년 9월 14일 (금) 오후 8:01, Holden Karau 님이 작성:
> >>>
> >>>
> >>>
> >>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson 
> wrote:
> >>>>
> >>>> To be clear, is this about "python-friendly API" or "friendly python
> API" ?
> >>>
> >>> Well what would you consider to be different between those two
> statements? I think it would be good to be a bit more explicit, but I don't
> think we should necessarily limit ourselves.
> >>>>
> >>>>
> >>>> On the python side, it might be nice to take advantage of static
> typing. Requires python 3.6 but with python 2 going EOL, a spark-3.0 might
> be a good opportunity to jump the python-3-only train.
> >>>
> >>> I think we can make types sort of work without ditching 2 (the types
> only would work in 3 but it would still function in 2). Ditching 2 entirely
> would be a big thing to consider, I honestly hadn't been considering that
> but it could be from just spending so much time maintaining a 2/3 code
> base. I'd suggest reaching out to to user@ before making that kind of
> change.
> >>>>
> >>>>
> >>>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
> wrote:
> >>>>>
> >>>>> Since we're talking about Spark 3.0 in the near future (and since
> some recent conversation on a proposed change reminded me) I wanted to open
> up the floor and see if folks have any ideas on how we could make a more
> Python friendly API for 3.0? I'm planning on taking some time to look at
> other systems in the solution space and see what we might want to learn
> from them but I'd love to hear what other folks are thinking too.
> >>>>>
> >>>>> --
> >>>>> Twitter: https://twitter.com/holdenkarau
> >>>>> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >>>>
> >>>>
> >
> >
>


Re: Python friendly API for Spark 3.0

2018-09-15 Thread Maciej Szymkiewicz
There is no need to ditch Python 2. There are basically two options

   - Use stub files and limit yourself to support only Python 3 support.
   Python 3 users benefit from type hints, Python 2 users don't, but no core
   functionality is affected. This is the approach I've used with
   https://github.com/zero323/pyspark-stubs/.
   - Use comment based inline syntax or stub files and don't use backward
   incompatible features (primarily typing module -
   https://docs.python.org/3/library/typing.html). Both Python 2 and 3 is
   supported, but more advanced components are not. Small win for Python 2
   users, moderate loss for Python 3 users.



On Sat, 15 Sep 2018 at 02:38, Nicholas Chammas 
wrote:

> Do we need to ditch Python 2 support to provide type hints? I don’t think
> so.
>
> Python lets you specify typing stubs that provide the same benefit without
> forcing Python 3.
>
> 2018년 9월 14일 (금) 오후 8:01, Holden Karau 님이 작성:
>
>>
>>
>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:
>>
>>> To be clear, is this about "python-friendly API" or "friendly python
>>> API" ?
>>>
>> Well what would you consider to be different between those two
>> statements? I think it would be good to be a bit more explicit, but I don't
>> think we should necessarily limit ourselves.
>>
>>>
>>> On the python side, it might be nice to take advantage of static typing.
>>> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
>>> good opportunity to jump the python-3-only train.
>>>
>> I think we can make types sort of work without ditching 2 (the types only
>> would work in 3 but it would still function in 2). Ditching 2 entirely
>> would be a big thing to consider, I honestly hadn't been considering that
>> but it could be from just spending so much time maintaining a 2/3 code
>> base. I'd suggest reaching out to to user@ before making that kind of
>> change.
>>
>>>
>>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
>>> wrote:
>>>
 Since we're talking about Spark 3.0 in the near future (and since some
 recent conversation on a proposed change reminded me) I wanted to open up
 the floor and see if folks have any ideas on how we could make a more
 Python friendly API for 3.0? I'm planning on taking some time to look at
 other systems in the solution space and see what we might want to learn
 from them but I'd love to hear what other folks are thinking too.

 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>>


Re: [DISCUSS] move away from python doctests

2018-08-29 Thread Maciej Szymkiewicz
Hi Imran,

On Wed, 29 Aug 2018 at 22:26, Imran Rashid 
wrote:

> Hi Li,
>
> yes that makes perfect sense.  That more-or-less is the same as my view,
> though I framed it differently.  I guess in that case, I'm really asking:
>
> Can pyspark changes please be accompanied by more unit tests, and not
> assume we're getting coverage from doctests?
>

I don't think such assumptions are made, or at least I haven't seen any
evidence of that.

 However,  we often assume that particular components are already tested in
Scala API (SQL, ML), and intentionally don't repeat these tests.


>
> Imran
>
> On Wed, Aug 29, 2018 at 2:02 PM Li Jin  wrote:
>
>> Hi Imran,
>>
>> My understanding is that doctests and unittests are orthogonal - doctests
>> are used to make sure docstring examples are correct and are not meant to
>> replace unittests.
>> Functionalities are covered by unit tests to ensure correctness and
>> doctests are used to test the docstring, not the functionalities itself.
>>
>> There are issues with doctests, for example, we cannot test arrow related
>> functions in doctest because of pyarrow is optional dependency, but I think
>> that's a separate issue.
>>
>> Does this make sense?
>>
>> Li
>>
>> On Wed, Aug 29, 2018 at 6:35 PM Imran Rashid 
>> wrote:
>>
>>> Hi,
>>>
>>> I'd like to propose that we move away from such heavy reliance on
>>> doctests in python, and move towards more traditional unit tests.  The main
>>> reason is that its hard to share test code in doc tests.  For example, I
>>> was just looking at
>>>
>>> https://github.com/apache/spark/commit/82c18c240a6913a917df3b55cc5e22649561c4dd
>>>  and wondering if we had any tests for some of the pyspark changes.
>>> SparkSession.createDataFrame has doctests, but those are just run with one
>>> standard spark configuration, which does not enable arrow.  Its hard to
>>> easily reuse that test, just with another spark context with a different
>>> conf.  Similarly I've wondered about reusing test cases but with
>>> local-cluster instead of local mode.  I feel like they also discourage
>>> writing a test which tries to get more exhaustive coverage on corner cases.
>>>
>>> I'm not saying we should stop using doctests -- I see why they're nice.
>>> I just think they should really only be when you want that code snippet in
>>> the doc anyway, so you might as well test it.
>>>
>>> Admittedly, I'm not really a python-developer, so I could be totally
>>> wrong about the right way to author doctests -- pushback welcome!
>>>
>>> Thoughts?
>>>
>>> thanks,
>>> Imran
>>>
>>


Re: Spark DataFrame UNPIVOT feature

2018-08-22 Thread Maciej Szymkiewicz
Given popularity of related SO questions:


   - https://stackoverflow.com/q/41670103/1560062
   - https://stackoverflow.com/q/42465568/1560062
   - https://stackoverflow.com/q/41670103/1560062

it is probably more "nobody thought about asking",  than "it is not used
often".

On Wed, 22 Aug 2018 at 00:07, Reynold Xin  wrote:

> Probably just because it is not used that often and nobody has submitted a
> patch for it. I've used pivot probably on average once a week (primarily in
> spreadsheets), but I've never used unpivot ...
>
>
> On Tue, Aug 21, 2018 at 3:06 PM Ivan Gozali  wrote:
>
>> Hi there,
>>
>> I was looking into why the UNPIVOT feature isn't implemented, given that
>> Spark already has PIVOT implemented natively in the DataFrame/Dataset API.
>>
>> Came across this JIRA  
>> which
>> talks about implementing PIVOT in Spark 1.6, but no mention whatsoever
>> regarding UNPIVOT, even though the JIRA curiously references a blog post
>> that talks about both PIVOT and UNPIVOT :)
>>
>> Is this because UNPIVOT is just simply generating multiple slim tables by
>> selecting each column, and making a union out of all of them?
>>
>> Thank you!
>>
>> --
>> Regards,
>>
>>
>> Ivan Gozali
>> Lecida
>> Email: i...@lecida.com
>>
>


Re: Increase Timeout or optimize Spark UT?

2017-08-24 Thread Maciej Szymkiewicz
It won't be used by PySpark and SparkR, will it?

On 23 August 2017 at 23:40, Michael Armbrust <mich...@databricks.com> wrote:

> I think we already set the number of partitions to 5 in tests
> <https://github.com/apache/spark/blob/6942aeeb0a0095a1ba85a817eb9e0edc410e5624/sql/core/src/test/scala/org/apache/spark/sql/test/TestSQLContext.scala#L60-L61>
> ?
>
> On Tue, Aug 22, 2017 at 3:25 PM, Maciej Szymkiewicz <
> mszymkiew...@gmail.com> wrote:
>
>> Hi,
>>
>> From my experience it is possible to cut quite a lot by reducing
>> spark.sql.shuffle.partitions to some reasonable value (let's say
>> comparable to the number of cores). 200 is a serious overkill for most of
>> the test cases anyway.
>>
>>
>> Best,
>> Maciej
>>
>>
>>
>> On 21 August 2017 at 03:00, Dong Joon Hyun <dh...@hortonworks.com> wrote:
>>
>>> +1 for any efforts to recover Jenkins!
>>>
>>>
>>>
>>> Thank you for the direction.
>>>
>>>
>>>
>>> Bests,
>>>
>>> Dongjoon.
>>>
>>>
>>>
>>> *From: *Reynold Xin <r...@databricks.com>
>>> *Date: *Sunday, August 20, 2017 at 5:53 PM
>>> *To: *Dong Joon Hyun <dh...@hortonworks.com>
>>> *Cc: *"dev@spark.apache.org" <dev@spark.apache.org>
>>> *Subject: *Re: Increase Timeout or optimize Spark UT?
>>>
>>>
>>>
>>> It seems like it's time to look into how to cut down some of the test
>>> runtimes. Test runtimes will slowly go up given the way development
>>> happens. 3 hr is already a very long time for tests to run.
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Aug 20, 2017 at 5:45 PM, Dong Joon Hyun <dh...@hortonworks.com>
>>> wrote:
>>>
>>> Hi, All.
>>>
>>>
>>>
>>> Recently, Apache Spark master branch test (SBT with hadoop-2.7 / 2.6)
>>> has been hitting the build timeout.
>>>
>>>
>>>
>>> Please see the build time trend.
>>>
>>>
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Tes
>>> t%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/buildTimeTrend
>>>
>>>
>>>
>>> All recent 22 builds fail due to timeout directly/indirectly. The last
>>> success (SBT with Hadoop-2.7) is 15th August.
>>>
>>>
>>>
>>> We may do the followings.
>>>
>>>
>>>
>>>1. Increase Build Timeout (3 hr 30 min)
>>>2. Optimize UTs (Scala/Java/Python/UT)
>>>
>>>
>>>
>>> But, Option 1 will be the immediate solution for now . Could you update
>>> the Jenkins setup?
>>>
>>>
>>>
>>> Bests,
>>>
>>> Dongjoon.
>>>
>>>
>>>
>>
>>
>


-- 

Z poważaniem,
Maciej Szymkiewicz


Re: Increase Timeout or optimize Spark UT?

2017-08-22 Thread Maciej Szymkiewicz
Hi,

>From my experience it is possible to cut quite a lot by reducing
spark.sql.shuffle.partitions to some reasonable value (let's say comparable
to the number of cores). 200 is a serious overkill for most of the test
cases anyway.


Best,
Maciej



On 21 August 2017 at 03:00, Dong Joon Hyun <dh...@hortonworks.com> wrote:

> +1 for any efforts to recover Jenkins!
>
>
>
> Thank you for the direction.
>
>
>
> Bests,
>
> Dongjoon.
>
>
>
> *From: *Reynold Xin <r...@databricks.com>
> *Date: *Sunday, August 20, 2017 at 5:53 PM
> *To: *Dong Joon Hyun <dh...@hortonworks.com>
> *Cc: *"dev@spark.apache.org" <dev@spark.apache.org>
> *Subject: *Re: Increase Timeout or optimize Spark UT?
>
>
>
> It seems like it's time to look into how to cut down some of the test
> runtimes. Test runtimes will slowly go up given the way development
> happens. 3 hr is already a very long time for tests to run.
>
>
>
>
>
> On Sun, Aug 20, 2017 at 5:45 PM, Dong Joon Hyun <dh...@hortonworks.com>
> wrote:
>
> Hi, All.
>
>
>
> Recently, Apache Spark master branch test (SBT with hadoop-2.7 / 2.6) has
> been hitting the build timeout.
>
>
>
> Please see the build time trend.
>
>
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%
> 20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/buildTimeTrend
>
>
>
> All recent 22 builds fail due to timeout directly/indirectly. The last
> success (SBT with Hadoop-2.7) is 15th August.
>
>
>
> We may do the followings.
>
>
>
>1. Increase Build Timeout (3 hr 30 min)
>2. Optimize UTs (Scala/Java/Python/UT)
>
>
>
> But, Option 1 will be the immediate solution for now . Could you update
> the Jenkins setup?
>
>
>
> Bests,
>
> Dongjoon.
>
>
>


Re: Possible bug: inconsistent timestamp behavior

2017-08-15 Thread Maciej Szymkiewicz
These two are just not equivalent.

Spark SQL interprets long as seconds when casting between timestamps and
numerics, therefore
lit(148550335L).cast(org.apache.spark.sql.types.TimestampType)
represents 49043-09-23 21:26:400.0. This behavior is intended - see for
example https://issues.apache.org/jira/browse/SPARK-11724

java.sql.Timestamp expects milliseconds as an argument therefore lit(new
java.sql.Timestamp(148550335L)) represents 2017-01-27 08:49:10
.

On 15 August 2017 at 13:16, assaf.mendelson <assaf.mendel...@rsa.com> wrote:

> Hi all,
>
> I encountered weird behavior for timestamp. It seems that when using lit
> to add it to column, the timestamp goes from milliseconds representation to
> seconds representation:
>
>
>
>
>
> scala> spark.range(1).withColumn("a", lit(new java.sql.Timestamp(
> 148550335L)).cast("long")).show()
>
> +---+--+
>
> | id| a|
>
> +---+--+
>
> |  0|1485503350|
>
> +---+--+
>
>
>
>
>
> scala> spark.range(1).withColumn("a", lit(148550335L).cast(org.
> apache.spark.sql.types.TimestampType).cast(org.apache.spark.sql.types.
> LongType)).show()
>
> +---+-+
>
> | id|a|
>
> +---+-+
>
> |  0|148550335|
>
> +---+-+
>
>
>
>
>
> Is this a bug or am I missing something here?
>
>
>
> Thanks,
>
> Assaf
>
>
>
> --
> View this message in context: Possible bug: inconsistent timestamp
> behavior
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-inconsistent-timestamp-behavior-tp22144.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>



-- 

Z poważaniem,
Maciej Szymkiewicz


Re: Speeding up Catalyst engine

2017-07-25 Thread Maciej Bryński
Hi,

I did backport this to 2.2.
First results of tests (join of about 60 tables).
Vanilla Spark: 50 sec
With 20392 - 38 sec
With 20392 and spark.sql.selfJoinAutoResolveAmbiguity=false - 29 sec
Vanilla Spark with spark.sql.selfJoinAutoResolveAmbiguity=false - 34 sec

I didn't measure any difference
changing spark.sql.constraintPropagation.enabled and any other spark.sql
option.

So I will leave your patch on top of 2.2
Thank you.

M.

2017-07-25 1:39 GMT+02:00 Liang-Chi Hsieh <vii...@gmail.com>:

>
> Hi Maciej,
>
> For backportting https://issues.apache.org/jira/browse/SPARK-20392, you
> can
> see the suggestion from committers on the PR. I think we don't expect it
> will be merged into 2.2.
>
>
>
> Maciej Bryński wrote
> > Hi Everyone,
> > I'm trying to speed up my Spark streaming application and I have
> following
> > problem.
> > I'm using a lot of joins in my app and full catalyst analysis is
> triggered
> > during every join.
> >
> > I found 2 options to speed up.
> >
> > 1) spark.sql.selfJoinAutoResolveAmbiguity  option
> > But looking at code:
> > https://github.com/apache/spark/blob/8cd9cdf17a7a4ad6f2eecd7c4b388c
> a363c20982/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L918
> >
> > Shouldn't lines 925-927 be before 920-922 ?
> >
> > 2) https://issues.apache.org/jira/browse/SPARK-20392
> >
> > Is it safe to use it on top of 2.2.0 ?
> >
> > Regards,
> > --
> > Maciek Bryński
>
>
>
>
>
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Speeding-up-
> Catalyst-engine-tp22013p22014.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Maciek Bryński


Speeding up Catalyst engine

2017-07-24 Thread Maciej Bryński
Hi Everyone,
I'm trying to speed up my Spark streaming application and I have following
problem.
I'm using a lot of joins in my app and full catalyst analysis is triggered
during every join.

I found 2 options to speed up.

1) spark.sql.selfJoinAutoResolveAmbiguity  option
But looking at code:
https://github.com/apache/spark/blob/8cd9cdf17a7a4ad6f2eecd7c4b388ca363c20982/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L918

Shouldn't lines 925-927 be before 920-922 ?

2) https://issues.apache.org/jira/browse/SPARK-20392

Is it safe to use it on top of 2.2.0 ?

Regards,
-- 
Maciek Bryński


Re: [ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-19 Thread Maciej Bryński
Oh yeah, new Spark version, new regression bugs :)

https://issues.apache.org/jira/browse/SPARK-21470

M.

2017-07-17 22:01 GMT+02:00 Sam Elamin :

> Well done!  This is amazing news :) Congrats and really cant wait to
> spread the structured streaming love!
>
> On Mon, Jul 17, 2017 at 5:25 PM, kant kodali  wrote:
>
>> +1
>>
>> On Tue, Jul 11, 2017 at 3:56 PM, Jean Georges Perrin  wrote:
>>
>>> Awesome! Congrats! Can't wait!!
>>>
>>> jg
>>>
>>>
>>> On Jul 11, 2017, at 18:48, Michael Armbrust 
>>> wrote:
>>>
>>> Hi all,
>>>
>>> Apache Spark 2.2.0 is the third release of the Spark 2.x line. This
>>> release removes the experimental tag from Structured Streaming. In
>>> addition, this release focuses on usability, stability, and polish,
>>> resolving over 1100 tickets.
>>>
>>> We'd like to thank our contributors and users for their contributions
>>> and early feedback to this release. This release would not have been
>>> possible without you.
>>>
>>> To download Spark 2.2.0, head over to the download page:
>>> http://spark.apache.org/downloads.html
>>>
>>> To view the release notes: https://spark.apache.or
>>> g/releases/spark-release-2-2-0.html
>>>
>>> *(note: If you see any issues with the release notes, webpage or
>>> published artifacts, please contact me directly off-list) *
>>>
>>> Michael
>>>
>>>
>>
>


-- 
Maciek Bryński


Re: Slowness of Spark Thrift Server

2017-07-17 Thread Maciej Bryński
I did the test on Spark 2.2.0 and problem still exists.

Any ideas how to fix it ?

Regards,
Maciek

2017-07-11 11:52 GMT+02:00 Maciej Bryński <mac...@brynski.pl>:

> Hi,
> I have following issue.
> I'm trying to use Spark as a proxy to Cassandra.
> The problem is the thrift server overhead.
>
> I'm using following query:
> select * from table where primay_key = 123
>
> Job time (from jobs tab) is around 50ms. (and it's similar to query time
> from SQL tab)
> Unfortunately query time from JDBC/ODBC Server is 650 ms.
> Any ideas why ? What could cause such an overhead ?
>
> Regards,
> --
> Maciek Bryński
>



-- 
Maciek Bryński


Slowness of Spark Thrift Server

2017-07-11 Thread Maciej Bryński
Hi,
I have following issue.
I'm trying to use Spark as a proxy to Cassandra.
The problem is the thrift server overhead.

I'm using following query:
select * from table where primay_key = 123

Job time (from jobs tab) is around 50ms. (and it's similar to query time
from SQL tab)
Unfortunately query time from JDBC/ODBC Server is 650 ms.
Any ideas why ? What could cause such an overhead ?

Regards,
-- 
Maciek Bryński


Re: Handling nulls in vector columns is non-trivial

2017-06-21 Thread Maciej Szymkiewicz
Since 2.2 there is Imputer:

https://github.com/apache/spark/blob/branch-2.2/examples/src/main/python/ml/imputer_example.py

which should at least partially address the problem.

On 06/22/2017 03:03 AM, Franklyn D'souza wrote:
> I just wanted to highlight some of the rough edges around using
> vectors in columns in dataframes. 
>
> If there is a null in a dataframe column containing vectors pyspark ml
> models like logistic regression will completely fail. 
>
> However from what i've read there is no good way to fill in these
> nulls with empty vectors. 
>
> Its not possible to create a literal vector column expressiong and
> coalesce it with the column from pyspark.
>  
> so we're left with writing a python udf which does this coalesce, this
> is really inefficient on large datasets and becomes a bottleneck for
> ml pipelines working with real world data.
>
> I'd like to know how other users are dealing with this and what plans
> there are to extend vector support for dataframes.
>
> Thanks!,
>
> Franklyn

-- 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: spark messing up handling of native dependency code?

2017-06-02 Thread Maciej Szymkiewicz
Maybe not related, but in general geotools are not thread safe,so using
from workers is most likely a gamble.

On 06/03/2017 01:26 AM, Georg Heiler wrote:
> Hi,
>
> There is a weird problem with spark when handling native dependency code:
> I want to use a library (JAI) with spark to parse some spatial raster
> files. Unfortunately, there are some strange issues. JAI only works
> when running via the build tool i.e. `sbt run` when executed in spark.
>
> When executed via spark-submit the error is:
>
> java.lang.IllegalArgumentException: The input argument(s) may not
> be null.
> at
> javax.media.jai.ParameterBlockJAI.getDefaultMode(ParameterBlockJAI.java:136)
> at
> javax.media.jai.ParameterBlockJAI.(ParameterBlockJAI.java:157)
> at
> javax.media.jai.ParameterBlockJAI.(ParameterBlockJAI.java:178)
> at
> org.geotools.process.raster.PolygonExtractionProcess.execute(PolygonExtractionProcess.java:171)
>
> Which looks like some native dependency (I think GEOS is running in
> the background) is not there correctly.
>
> Assuming something is wrong with the class path I tried to run a plain
> java/scala function. but this one works just fine.
>
> Is spark messing with the class paths?
>
> I created a minimal example here:
> https://github.com/geoHeil/jai-packaging-problem
>
>
> Hope someone can shed some light on this problem,
> Regards,
> Georg 



Re: [PYTHON] PySpark typing hints

2017-05-23 Thread Maciej Szymkiewicz


On 05/23/2017 02:45 PM, Mendelson, Assaf wrote:
>
> You are correct,
>
> I actually did not look too deeply into it until now as I noticed you
> mentioned it is compatible with python 3 only and I saw in the github
> that mypy or pytype is required.
>
>  
>
> Because of that I made my suggestions with the thought of python 2.
>
>  
>
> Looking into it more deeply, I am wondering what is not supported? Are
> you talking about limitation for testing?
>

Since type checkers (unlike annotations) are not standardized, this
varies between projects and versions. For MyPy quite a lot changed since
I started annotating Spark.

Few months ago I wouldn't even bother looking at the list of issues,
today (as mentioned in the other message) we could remove metaclasses,
and pass both Python 2 and Python 3 checks.

The other part is typing module itself, as well as function annotations
(outside docstrings). But this is not a problem with stub files.
>
>  
>
> If I understand correctly then one can use this without any issues for
> pycharm (and other IDEs supporting the type hinting) even when
> developing for python 2.
>

This strictly depends on type checker. I didn't follow the development,
but I got this impression that a lot changed for example between PyCharm
2016.3 and 2017.1. I think that the important point is that lack of
support, doesn't break anything.
>
> In addition, the tests can test the existing pyspark, they just have
> to be run with a compatible packaging (e.g. mypy).
>
> Meaning that porting for python 2 would provide a very small advantage
> over the immediate advantages (IDE usage and testing for most cases).
>
>  
>
> Am I missing something?
>
>  
>
> Thanks,
>
>   Assaf.
>
>  
>
> *From:*Maciej Szymkiewicz [mailto:mszymkiew...@gmail.com]
> *Sent:* Tuesday, May 23, 2017 3:27 PM
> *To:* Mendelson, Assaf
> *Subject:* Re: [PYTHON] PySpark typing hints
>
>  
>
>  
>
>  
>
> On 05/23/2017 01:12 PM, assaf.mendelson wrote:
>
> That said, If we make a decision on the way to handle it then I
> believe it would be a good idea to start even with the bare
> minimum and continue to add to it (and therefore make it so many
> people can contribute). The code I added in github were basically
> the things I needed.
>
> I already have almost full coverage of the API, excluding some exotic
> part of the legacy streaming, so starting with bare minimum is not
> really required.
>
> The advantage of the first is that it is part of the code which means
> it is easier to make it updated. The main issue with this is that
> supporting auto generated code (as is the case in most functions) can
> be a little awkward and actually is a relate to a separate issue as it
> means pycharm marks most of the functions as an error (i.e.
> pyspark.sql.functions.XXX is marked as not there…)
>
>
> Comment based annotations are not suitable for complex signatures with
> multliversion support.
>
> Also there is no support for overloading, therefore it is not possible
> to capture relationship between arguments, and arguments and return type.
>

-- 
Maciej Szymkiewicz



signature.asc
Description: OpenPGP digital signature


Re: [PYTHON] PySpark typing hints

2017-05-23 Thread Maciej Szymkiewicz
It doesn't break anything at all. You can take stub files as-is, put
these into PySpark root, and as long as users are not interested in type
checking, it won't have any runtime impact.

Surprisingly the current MyPy build (mypy==0.511) reports only one
incompatibility with Python 2 (dynamic metaclasses), which is could be
resolved without significant loss of function.

On 05/23/2017 12:08 PM, Reynold Xin wrote:
> Seems useful to do. Is there a way to do this so it doesn't break
> Python 2.x?
>
>
> On Sun, May 14, 2017 at 11:44 PM, Maciej Szymkiewicz
> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>
> Hi everyone,
>
> For the last few months I've been working on static type
> annotations for PySpark. For those of you, who are not familiar
> with the idea, typing hints have been introduced by PEP 484
> (https://www.python.org/dev/peps/pep-0484/
> <https://www.python.org/dev/peps/pep-0484/>) and further extended
> with PEP 526 (https://www.python.org/dev/peps/pep-0526/
> <https://www.python.org/dev/peps/pep-0526/>) with the main goal of
> providing information required for static analysis. Right now
> there a few tools which support typing hints, including Mypy
> (https://github.com/python/mypy <https://github.com/python/mypy>)
> and PyCharm
> 
> (https://www.jetbrains.com/help/pycharm/2017.1/type-hinting-in-pycharm.html
> 
> <https://www.jetbrains.com/help/pycharm/2017.1/type-hinting-in-pycharm.html>).
>  
> Type hints can be added using function annotations
> (https://www.python.org/dev/peps/pep-3107/
> <https://www.python.org/dev/peps/pep-3107/>, Python 3 only),
> docstrings, or source independent stub files
> (https://www.python.org/dev/peps/pep-0484/#stub-files
> <https://www.python.org/dev/peps/pep-0484/#stub-files>). Typing is
> optional, gradual and has no runtime impact.
>
> At this moment I've annotated majority of the API, including
> majority of pyspark.sql and pyspark.ml <http://pyspark.ml>. At
> this moment project is still rough around the edges, and may
> result in both false positive and false negatives, but I think it
> become mature enough to be useful in practice.
>
> The current version is compatible only with Python 3, but it is
> possible, with some limitations, to backport it to Python 2
> (though it is not on my todo list).
>
> There is a number of possible benefits for PySpark users and
> developers:
>
>   * Static analysis can detect a number of common mistakes to
> prevent runtime failures. Generic self is still fairly
> limited, so it is more useful with DataFrames, SS and ML than
> RDD, DStreams or RDD.
>   * Annotations can be used for documenting complex signatures
> (https://git.io/v95JN) including dependencies on arguments and
> value (https://git.io/v95JA).
>   * Detecting possible bugs in Spark (SPARK-20631) .
>   * Showing API inconsistencies.
>
> Roadmap
>
>   * Update the project to reflect Spark 2.2.
>   * Refine existing annotations.
>
> If there will be enough interest I am happy to contribute this
> back to Spark or submit to Typeshed
> (https://github.com/python/typeshed
> <https://github.com/python/typeshed> -  this would require a
> formal ASF approval, and since Typeshed doesn't provide
> versioning, is probably not the best option in our case).
>
> Further inforamtion:
>
>   * https://github.com/zero323/pyspark-stubs
> <https://github.com/zero323/pyspark-stubs> - GitHub repository
>
>   * 
> https://speakerdeck.com/marcobonzanini/static-type-analysis-for-robust-data-products-at-pydata-london-2017
> 
> <https://speakerdeck.com/marcobonzanini/static-type-analysis-for-robust-data-products-at-pydata-london-2017>
> - interesting presentation by Marco Bonzanini
>
> -- 
> Best,
> Maciej
>
>

-- 
Maciej Szymkiewicz



signature.asc
Description: OpenPGP digital signature


[PYTHON] PySpark typing hints

2017-05-14 Thread Maciej Szymkiewicz
Hi everyone,

For the last few months I've been working on static type annotations for
PySpark. For those of you, who are not familiar with the idea, typing
hints have been introduced by PEP 484
(https://www.python.org/dev/peps/pep-0484/) and further extended with
PEP 526 (https://www.python.org/dev/peps/pep-0526/) with the main goal
of providing information required for static analysis. Right now there a
few tools which support typing hints, including Mypy
(https://github.com/python/mypy) and PyCharm
(https://www.jetbrains.com/help/pycharm/2017.1/type-hinting-in-pycharm.html). 
Type hints can be added using function annotations
(https://www.python.org/dev/peps/pep-3107/, Python 3 only), docstrings,
or source independent stub files
(https://www.python.org/dev/peps/pep-0484/#stub-files). Typing is
optional, gradual and has no runtime impact.

At this moment I've annotated majority of the API, including majority of
pyspark.sql and pyspark.ml. At this moment project is still rough around
the edges, and may result in both false positive and false negatives,
but I think it become mature enough to be useful in practice.

The current version is compatible only with Python 3, but it is
possible, with some limitations, to backport it to Python 2 (though it
is not on my todo list).

There is a number of possible benefits for PySpark users and developers:

  * Static analysis can detect a number of common mistakes to prevent
runtime failures. Generic self is still fairly limited, so it is
more useful with DataFrames, SS and ML than RDD, DStreams or RDD.
  * Annotations can be used for documenting complex signatures
(https://git.io/v95JN) including dependencies on arguments and value
(https://git.io/v95JA).
  * Detecting possible bugs in Spark (SPARK-20631) .
  * Showing API inconsistencies.

Roadmap

  * Update the project to reflect Spark 2.2.
  * Refine existing annotations.

If there will be enough interest I am happy to contribute this back to
Spark or submit to Typeshed (https://github.com/python/typeshed -  this
would require a formal ASF approval, and since Typeshed doesn't provide
versioning, is probably not the best option in our case).

Further inforamtion:

  * https://github.com/zero323/pyspark-stubs - GitHub repository

  * 
https://speakerdeck.com/marcobonzanini/static-type-analysis-for-robust-data-products-at-pydata-london-2017
- interesting presentation by Marco Bonzanini

-- 
Best,
Maciej



signature.asc
Description: OpenPGP digital signature


Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-29 Thread Maciej Szymkiewicz
I am not sure if it is relevant but explode_outer and posexplode_outer
seem to be broken: SPARK-20534



On 04/28/2017 12:49 AM, Sean Owen wrote:
> By the way the RC looks good. Sigs and license are OK, tests pass with
> -Phive -Pyarn -Phadoop-2.7. +1 from me.
>
> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust
> > wrote:
>
> Please vote on releasing the following candidate as Apache Spark
> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00
> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc1
>  
> (8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be
> found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
> 
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1235/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
> 
> 
>
>
> *FAQ*
>
> *How can I help test this release?*
> *
> *
> If you are a Spark user, you can help us test this release by
> taking an existing Spark workload and running on this release
> candidate, then reporting any regressions.
> *
> *
> *What should happen to JIRA tickets still targeting 2.2.0?*
> *
> *
> Committers should look at those and triage. Extremely important
> bug fixes, documentation, and API tweaks that impact compatibility
> should be worked on immediately. Everything else please retarget
> to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from 2.1.1.
>



Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-14 Thread Maciej Bryński
https://issues.apache.org/jira/browse/SPARK-12717

This bug is in Spark since 1.6.0.
Any chance to get this fixed ?

M.

2017-04-14 6:39 GMT+02:00 Holden Karau :
> If it would help I'd be more than happy to look at kicking off the packaging
> for RC3 since I'v been poking around in Jenkins a bit (for SPARK-20216 &
> friends) (I'd still probably need some guidance from a previous release
> coordinator so I understand if that's not actually faster).
>
> On Mon, Apr 10, 2017 at 6:39 PM, DB Tsai  wrote:
>>
>> I backported the fix into both branch-2.1 and branch-2.0. Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0x5CED8B896A6BDFA0
>>
>>
>> On Mon, Apr 10, 2017 at 4:20 PM, Ryan Blue  wrote:
>> > DB,
>> >
>> > This vote already failed and there isn't a RC3 vote yet. If you backport
>> > the
>> > changes to branch-2.1 they will make it into the next RC.
>> >
>> > rb
>> >
>> > On Mon, Apr 10, 2017 at 3:55 PM, DB Tsai  wrote:
>> >>
>> >> -1
>> >>
>> >> I think that back-porting SPARK-20270 and SPARK-18555 are very
>> >> important
>> >> since it's a critical bug that na.fill will mess up the data in Long
>> >> even
>> >> the data isn't null.
>> >>
>> >> Thanks.
>> >>
>> >>
>> >> Sincerely,
>> >>
>> >> DB Tsai
>> >> --
>> >> Web: https://www.dbtsai.com
>> >> PGP Key ID: 0x5CED8B896A6BDFA0
>> >>
>> >> On Wed, Apr 5, 2017 at 11:12 AM, Holden Karau 
>> >> wrote:
>> >>>
>> >>> Following up, the issues with missing pypandoc/pandoc on the packaging
>> >>> machine has been resolved.
>> >>>
>> >>> On Tue, Apr 4, 2017 at 3:54 PM, Holden Karau 
>> >>> wrote:
>> 
>>  See SPARK-20216, if Michael can let me know which machine is being
>>  used
>>  for packaging I can see if I can install pandoc on it (should be
>>  simple but
>>  I know the Jenkins cluster is a bit on the older side).
>> 
>>  On Tue, Apr 4, 2017 at 3:06 PM, Holden Karau 
>>  wrote:
>> >
>> > So the fix is installing pandoc on whichever machine is used for
>> > packaging. I thought that was generally done on the machine of the
>> > person
>> > rolling the release so I wasn't sure it made sense as a JIRA, but
>> > from
>> > chatting with Josh it sounds like that part might be on of the
>> > Jenkins
>> > workers - is there a fixed one that is used?
>> >
>> > Regardless I'll file a JIRA for this when I get back in front of my
>> > desktop (~1 hour or so).
>> >
>> > On Tue, Apr 4, 2017 at 2:35 PM Michael Armbrust
>> >  wrote:
>> >>
>> >> Thanks for the comments everyone.  This vote fails.  Here's how I
>> >> think we should proceed:
>> >>  - [SPARK-20197] - SparkR CRAN - appears to be resolved
>> >>  - [SPARK-] - Python packaging - Holden, please file a JIRA and
>> >> report if this is a regression and if there is an easy fix that we
>> >> should
>> >> wait for.
>> >>
>> >> For all the other test failures, please take the time to look
>> >> through
>> >> JIRA and open an issue if one does not already exist so that we can
>> >> triage
>> >> if these are just environmental issues.  If I don't hear any
>> >> objections I'm
>> >> going to go ahead with RC3 tomorrow.
>> >>
>> >> On Sun, Apr 2, 2017 at 1:16 PM, Felix Cheung
>> >>  wrote:
>> >>>
>> >>> -1
>> >>> sorry, found an issue with SparkR CRAN check.
>> >>> Opened SPARK-20197 and working on fix.
>> >>>
>> >>> 
>> >>> From: holden.ka...@gmail.com  on behalf of
>> >>> Holden Karau 
>> >>> Sent: Friday, March 31, 2017 6:25:20 PM
>> >>> To: Xiao Li
>> >>> Cc: Michael Armbrust; dev@spark.apache.org
>> >>> Subject: Re: [VOTE] Apache Spark 2.1.1 (RC2)
>> >>>
>> >>> -1 (non-binding)
>> >>>
>> >>> Python packaging doesn't seem to have quite worked out (looking at
>> >>> PKG-INFO the description is "Description: ! missing pandoc do
>> >>> not upload
>> >>> to PyPI "), ideally it would be nice to have this as a version
>> >>> we
>> >>> upgrade to PyPi.
>> >>> Building this on my own machine results in a longer description.
>> >>>
>> >>> My guess is that whichever machine was used to package this is
>> >>> missing the pandoc executable (or possibly pypandoc library).
>> >>>
>> >>> On Fri, Mar 31, 2017 at 3:40 PM, Xiao Li 
>> >>> wrote:
>> 
>>  +1
>> 
>>  Xiao
>> 
>>  2017-03-30 16:09 GMT-07:00 Michael Armbrust
>>  :

Re: [Pyspark, SQL] Very slow IN operator

2017-04-06 Thread Maciej Bryński
2017-04-06 4:00 GMT+02:00 Michael Segel :
> Just out of curiosity, what would happen if you put your 10K values in to a 
> temp table and then did a join against it?

The answer is predicates pushdown.
In my case I'm using this kind of query on JDBC table and IN predicate
is executed on DB in less than 1s.


Regards,
-- 
Maciek Bryński

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Maciej Bryński
Hi,
I'm trying to run queries with many values in IN operator.

The result is that for more than 10K values IN operator is getting slower.

For example this code is running about 20 seconds.

df = spark.range(0,10,1,1)
df.where('id in ({})'.format(','.join(map(str,range(10).count()

Any ideas how to improve this ?
Is it a bug ?
-- 
Maciek Bryński

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[SQL] Unresolved reference with chained window functions.

2017-03-24 Thread Maciej Szymkiewicz
Forwarded from SO (http://stackoverflow.com/q/43007433). Looks like
regression compared to 2.0.2.

scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window

scala> val win_spec_max =
Window.partitionBy("x").orderBy("AmtPaid").rowsBetween(Window.unboundedPreceding,
0)
win_spec_max: org.apache.spark.sql.expressions.WindowSpec =
org.apache.spark.sql.expressions.WindowSpec@3433e418

scala> val df = Seq((1, 2.0), (1, 3.0), (1, 1.0), (1, -2.0), (1,
-1.0)).toDF("x", "AmtPaid")
df: org.apache.spark.sql.DataFrame = [x: int, AmtPaid: double]

scala> val df_with_sum = df.withColumn("AmtPaidCumSum",
sum(col("AmtPaid")).over(win_spec_max))
df_with_sum: org.apache.spark.sql.DataFrame = [x: int, AmtPaid: double
... 1 more field]

scala> val df_with_max = df_with_sum.withColumn("AmtPaidCumSumMax",
max(col("AmtPaidCumSum")).over(win_spec_max))
df_with_max: org.apache.spark.sql.DataFrame = [x: int, AmtPaid: double
... 2 more fields]

scala> df_with_max.explain
== Physical Plan ==
!Window [sum(AmtPaid#361) windowspecdefinition(x#360, AmtPaid#361 ASC
NULLS FIRST, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS
AmtPaidCumSum#366, max(AmtPaidCumSum#366) windowspecdefinition(x#360,
AmtPaid#361 ASC NULLS FIRST, ROWS BETWEEN UNBOUNDED PRECEDING AND
CURRENT ROW) AS AmtPaidCumSumMax#372], [x#360], [AmtPaid#361 ASC NULLS
FIRST]
+- *Sort [x#360 ASC NULLS FIRST, AmtPaid#361 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(x#360, 200)
  +- LocalTableScan [x#360, AmtPaid#361]

scala> df_with_max.printSchema
root
 |-- x: integer (nullable = false)
 |-- AmtPaid: double (nullable = false)
 |-- AmtPaidCumSum: double (nullable = true)
 |-- AmtPaidCumSumMax: double (nullable = true)

scala> df_with_max.show
17/03/24 21:22:32 ERROR Executor: Exception in task 0.0 in stage 19.0
(TID 234)
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding
attribute, tree: AmtPaidCumSum#366
at
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
   ...
Caused by: java.lang.RuntimeException: Couldn't find AmtPaidCumSum#366
in [sum#385,max#386,x#360,AmtPaid#361]
   ...

Is it a known issue or do we need a JIRA?

-- 
Best,
Maciej Szymkiewicz


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[ML][PYTHON] Collecting data in a class extending SparkSessionTestCase causes AttributeError:

2017-03-06 Thread Maciej Szymkiewicz
Hi everyone,

It is a either to late or to early for me to think straight so please
forgive me if it is something trivial. I am trying to add a test case
extending SparkSessionTestCase to pyspark.ml.tests (example patch
attached). If test collects data, and there is another TestCase
extending extending SparkSessionTestCase executed before it, I get
AttributeError due to _jsc being None:

==

ERROR: test_foo (pyspark.ml.tests.FooTest)

--

Traceback (most recent call last):

  File "/home/spark/python/pyspark/ml/tests.py", line 1258, in test_foo

  File "/home/spark/python/pyspark/sql/dataframe.py", line 389, in collect

with SCCallSiteSync(self._sc) as css:

  File "/home/spark/python/pyspark/traceback_utils.py", line 72, in __enter__

self._context._jsc.setCallSite(self._call_site)

AttributeError: 'NoneType' object has no attribute 'setCallSite'

--

If TestCase is executed alone it seems to work just fine.


Can anyone reproduce this? Is there something obvious I miss here?

-- 
Best,
Maciej

diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index 3524160557..cc6e49d6cf 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -1245,6 +1245,17 @@ class ALSTest(SparkSessionTestCase):
 self.assertEqual(als.getFinalStorageLevel(), "DISK_ONLY")
 self.assertEqual(als._java_obj.getFinalStorageLevel(), "DISK_ONLY")
 
+als.fit(df).userFactors.collect()
+
+
+class FooTest(SparkSessionTestCase):
+def test_foo(self):
+df = self.spark.createDataFrame(
+[(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), 
(2, 2, 5.0)],
+["user", "item", "rating"])
+als = ALS().setMaxIter(1).setRank(1)
+als.fit(df).userFactors.collect()
+
 
 class DefaultValuesTests(PySparkTestCase):
 """


signature.asc
Description: OpenPGP digital signature


Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

2017-02-14 Thread Maciej Szymkiewicz
I don't have any strong views, so just to highlight possible issues:

  * Based on different issues I've seen there is a substantial amount of
users which depend on system wide Python installations. As far as I
am aware neither Py4j nor cloudpickle are present in the standard
system repositories in Debian or Red Hat derivatives.
  * Assuming that Spark is committed to supporting Python 2 beyond its
end of life we have to be sure that any external dependency has the
same policy.
  * Py4j is missing from default Anaconda channel. Not a big issue, just
a small annoyance.
  * External dependencies with pinned versions add some overhead to the
development across versions (effectively we may need a separate env
for each major Spark release). I've seen small inconsistencies in
PySpark behavior with different Py4j versions so it is not
completely hypothetical.
  * Adding possible version conflicts. It is probably not a big risk but
something to consider (for example in combination Blaze + Dask +
PySpark).
  * Adding another party user has to trust.


On 02/14/2017 12:22 AM, Holden Karau wrote:
> It's a good question. Py4J seems to have been updated 5 times in 2016
> and is a bit involved (from a review point of view verifying the zip
> file contents is somewhat tedious).
>
> cloudpickle is a bit difficult to tell since we can have changes to
> cloudpickle which aren't correctly tagged as backporting changes from
> the fork (and this can take awhile to review since we don't always
> catch them right away as being backports).
>
> Another difficulty with looking at backports is that since our review
> process for PySpark has historically been on the slow side, changes
> benefiting systems like dask or IPython parallel were not backported
> to Spark unless they caused serious errors.
>
> I think the key benefits are better test coverage of the forked
> version of cloudpickle, using a more standardized packaging of
> dependencies, simpler updates of dependencies reduces friction to
> gaining benefits from other related projects work - Python
> serialization really isn't our secret sauce.
>
> If I'm missing any substantial benefits or costs I'd love to know :)
>
> On Mon, Feb 13, 2017 at 3:03 PM, Reynold Xin <r...@databricks.com
> <mailto:r...@databricks.com>> wrote:
>
> With any dependency update (or refactoring of existing code), I
> always ask this question: what's the benefit? In this case it
> looks like the benefit is to reduce efforts in backports. Do you
> know how often we needed to do those?
>
>
> On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau
> <hol...@pigscanfly.ca <mailto:hol...@pigscanfly.ca>> wrote:
>
> Hi PySpark Developers,
>
> Cloudpickle is a core part of PySpark, and is originally
> copied from (and improved from) picloud. Since then other
> projects have found cloudpickle useful and a fork of
> cloudpickle <https://github.com/cloudpipe/cloudpickle> was
> created and is now maintained as its own library
> <https://pypi.python.org/pypi/cloudpickle> (with better test
> coverage and resulting bug fixes I understand). We've had a
> few PRs backporting fixes from the cloudpickle project into
> Spark's local copy of cloudpickle - how would people feel
> about moving to taking an explicit (pinned) dependency on
> cloudpickle?
>
> We could add cloudpickle to the setup.py and a
> requirements.txt file for users who prefer not to do a system
> installation of PySpark.
>
> Py4J is maybe even a simpler case, we currently have a zip of
> py4j in our repo but could instead have a pinned version
> required. While we do depend on a lot of py4j internal APIs,
> version pinning should be sufficient to ensure functionality
> (and simplify the update process).
>
> Cheers,
>
>     Holden :)
>
> -- 
> Twitter: https://twitter.com/holdenkarau
> <https://twitter.com/holdenkarau>
>
>
>
>
>
> -- 
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau

-- 
Maciej Szymkiewicz



Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Maciej Szymkiewicz
Congratulations!


On 02/13/2017 08:16 PM, Reynold Xin wrote:
> Hi all,
>
> Takuya-san has recently been elected an Apache Spark committer. He's
> been active in the SQL area and writes very small, surgical patches
> that are high quality. Please join me in congratulating Takuya-san!
>


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-03 Thread Maciej Szymkiewicz
Hi Liang-Chi,

Thank you for the updates. This looks promising.


On 02/03/2017 08:34 AM, Liang-Chi Hsieh wrote:
> Hi Maciej,
>
> FYI, this fix is submitted at https://github.com/apache/spark/pull/16785.
>
>
> Liang-Chi Hsieh wrote
>> Hi Maciej,
>>
>> After looking into the details of the time spent on preparing the executed
>> plan, the cause of the significant difference between 1.6 and current
>> codebase when running the example, is the optimization process to generate
>> constraints.
>>
>> There seems few operations in generating constraints which are not
>> optimized. Plus the fact the query plan grows continuously, the time spent
>> on generating constraints increases more and more.
>>
>> I am trying to reduce the time cost. Although not as low as 1.6 because we
>> can't remove the process of generating constraints, it is significantly
>> lower than current codebase (74294 ms -> 2573 ms).
>>
>> 385 ms
>> 107 ms
>> 46 ms
>> 58 ms
>> 64 ms
>> 105 ms
>> 86 ms
>> 122 ms
>> 115 ms
>> 114 ms
>> 100 ms
>> 109 ms
>> 169 ms
>> 196 ms
>> 174 ms
>> 212 ms
>> 290 ms
>> 254 ms
>> 318 ms
>> 405 ms
>> 347 ms
>> 443 ms
>> 432 ms
>> 500 ms
>> 544 ms
>> 619 ms
>> 697 ms
>> 683 ms
>> 807 ms
>> 802 ms
>> 960 ms
>> 1010 ms
>> 1155 ms
>> 1251 ms
>> 1298 ms
>> 1388 ms
>> 1503 ms
>> 1613 ms
>> 2279 ms
>> 2349 ms
>> 2573 ms
>>
>> Liang-Chi Hsieh wrote
>>> Hi Maciej,
>>>
>>> Thanks for the info you provided.
>>>
>>> I tried to run the same example with 1.6 and current branch and record
>>> the difference between the time cost on preparing the executed plan.
>>>
>>> Current branch:
>>>
>>> 292 ms  
>>>
>>> 95 ms 
>>> 57 ms
>>> 34 ms
>>> 128 ms
>>> 120 ms
>>> 63 ms
>>> 106 ms
>>> 179 ms
>>> 159 ms
>>> 235 ms
>>> 260 ms
>>> 334 ms
>>> 464 ms
>>> 547 ms 
>>> 719 ms
>>> 942 ms
>>> 1130 ms
>>> 1928 ms
>>> 1751 ms
>>> 2159 ms
>>> 2767 ms
>>>  ms
>>> 4175 ms
>>> 5106 ms
>>> 6269 ms
>>> 7683 ms
>>> 9210 ms
>>> 10931 ms
>>> 13237 ms
>>> 15651 ms
>>> 19222 ms
>>> 23841 ms
>>> 26135 ms
>>> 31299 ms
>>> 38437 ms
>>> 47392 ms
>>> 51420 ms
>>> 60285 ms
>>> 69840 ms
>>> 74294 ms
>>>
>>> 1.6:
>>>
>>> 3 ms
>>> 4 ms
>>> 10 ms
>>> 4 ms
>>> 17 ms
>>> 8 ms
>>> 12 ms
>>> 21 ms
>>> 15 ms
>>> 15 ms
>>> 19 ms
>>> 23 ms
>>> 28 ms
>>> 28 ms
>>> 58 ms
>>> 39 ms
>>> 43 ms
>>> 61 ms
>>> 56 ms
>>> 60 ms
>>> 81 ms
>>> 73 ms
>>> 100 ms
>>> 91 ms
>>> 96 ms
>>> 116 ms
>>> 111 ms
>>> 140 ms
>>> 127 ms
>>> 142 ms
>>> 148 ms
>>> 165 ms
>>> 171 ms
>>> 198 ms
>>> 200 ms
>>> 233 ms
>>> 237 ms
>>> 253 ms
>>> 256 ms
>>> 271 ms
>>> 292 ms
>>> 452 ms
>>>
>>> Although they both take more time after each iteration due to the grown
>>> query plan, it is obvious that current branch takes much more time than
>>> 1.6 branch. The optimizer and query planning in current branch is much
>>> more complicated than 1.6.
>>> zero323 wrote
>>>> Hi Liang-Chi,
>>>>
>>>> Thank you for your answer and PR but what I think I wasn't specific
>>>> enough. In hindsight I should have illustrate this better. What really
>>>> troubles me here is a pattern of growing delays. Difference between
>>>> 1.6.3 (roughly 20s runtime since the first job):
>>>>
>>>>
>>>> 1.6 timeline
>>>>
>>>> vs 2.1.0 (45 minutes or so in a bad case):
>>>>
>>>> 2.1.0 timeline
>>>>
>>>> The code is just an example and it is intentionally dumb. You easily
>>>> mask this with caching, or using signific

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Maciej Szymkiewicz
Hi Liang-Chi,

Thank you for your answer and PR but what I think I wasn't specific
enough. In hindsight I should have illustrate this better. What really
troubles me here is a pattern of growing delays. Difference between
1.6.3 (roughly 20s runtime since the first job):


1.6 timeline

vs 2.1.0 (45 minutes or so in a bad case):

2.1.0 timeline

The code is just an example and it is intentionally dumb. You easily
mask this with caching, or using significantly larger data sets. So I
guess the question I am really interested in is - what changed between
1.6.3 and 2.x (this is more or less consistent across 2.0, 2.1 and
current master) to cause this and more important, is it a feature or is
it a bug? I admit, I choose a lazy path here, and didn't spend much time
(yet) trying to dig deeper.

I can see a bit higher memory usage, a bit more intensive GC activity,
but nothing I would really blame for this behavior, and duration of
individual jobs is comparable with some favor of 2.x. Neither
StringIndexer nor OneHotEncoder changed much in 2.x. They used RDDs for
fitting in 1.6 and, as far as I can tell, they still do that in 2.x. And
the problem doesn't look that related to the data processing part in the
first place.


On 02/02/2017 07:22 AM, Liang-Chi Hsieh wrote:
> Hi Maciej,
>
> FYI, the PR is at https://github.com/apache/spark/pull/16775.
>
>
> Liang-Chi Hsieh wrote
>> Hi Maciej,
>>
>> Basically the fitting algorithm in Pipeline is an iterative operation.
>> Running iterative algorithm on Dataset would have RDD lineages and query
>> plans that grow fast. Without cache and checkpoint, it gets slower when
>> the iteration number increases.
>>
>> I think it is why when you run a Pipeline with long stages, it gets much
>> longer time to finish. As I think it is not uncommon to have long stages
>> in a Pipeline, we should improve this. I will submit a PR for this.
>> zero323 wrote
>>> Hi everyone,
>>>
>>> While experimenting with ML pipelines I experience a significant
>>> performance regression when switching from 1.6.x to 2.x.
>>>
>>> import org.apache.spark.ml.{Pipeline, PipelineStage}
>>> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
>>> VectorAssembler}
>>>
>>> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
>>> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
>>> val indexers = df.columns.tail.map(c => new StringIndexer()
>>>   .setInputCol(c)
>>>   .setOutputCol(s"${c}_indexed")
>>>   .setHandleInvalid("skip"))
>>>
>>> val encoders = indexers.map(indexer => new OneHotEncoder()
>>>   .setInputCol(indexer.getOutputCol)
>>>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
>>>   .setDropLast(true))
>>>
>>> val assembler = new
>>> VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
>>> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler
>>>
>>> new Pipeline().setStages(stages).fit(df).transform(df).show
>>>
>>> Task execution time is comparable and executors are most of the time
>>> idle so it looks like it is a problem with the optimizer. Is it a known
>>> issue? Are there any changes I've missed, that could lead to this
>>> behavior?
>>>
>>> -- 
>>> Best,
>>> Maciej
>>>
>>>
>>> -
>>> To unsubscribe e-mail: 
>>> dev-unsubscribe@.apache
>
>
>
>
> -
> Liang-Chi Hsieh | @viirya 
> Spark Technology Center 
> http://www.spark.tc/ 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-- 
Maciej Szymkiewicz



[SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-01-31 Thread Maciej Szymkiewicz
Hi everyone,

While experimenting with ML pipelines I experience a significant
performance regression when switching from 1.6.x to 2.x.

import org.apache.spark.ml.{Pipeline, PipelineStage}
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
VectorAssembler}

val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
"baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
val indexers = df.columns.tail.map(c => new StringIndexer()
  .setInputCol(c)
  .setOutputCol(s"${c}_indexed")
  .setHandleInvalid("skip"))

val encoders = indexers.map(indexer => new OneHotEncoder()
  .setInputCol(indexer.getOutputCol)
  .setOutputCol(s"${indexer.getOutputCol}_encoded")
  .setDropLast(true))

val assembler = new
VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler

new Pipeline().setStages(stages).fit(df).transform(df).show

Task execution time is comparable and executors are most of the time
idle so it looks like it is a problem with the optimizer. Is it a known
issue? Are there any changes I've missed, that could lead to this behavior?

-- 
Best,
Maciej


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Maciej Szymkiewicz
Thanks for the response Burak,

As any sane person I try to steer away from the objects which have both
calendar and unsafe in their fully qualified names but if there is no
bigger picture I missed here I would go with 1 as well. And of course
fix the error message. I understand this has been introduced with
structured streaming in mind, but it is an useful feature in general,
not only in high precision scale. To be honest I would love to see some
generalized version which could be used (I mean without hacking) with
arbitrary numeric sequence. It could address at least some scenarios in
which people try to use window functions without PARTITION BY clause and
fail miserably.

Regarding ambiguity... Sticking with days doesn't really resolve the
problem, does it? If one were to nitpick it doesn't look like this
implementation even touches all the subtleties of DST or leap second.



On 01/18/2017 05:52 PM, Burak Yavuz wrote:
> Hi Maciej,
>
> I believe it would be useful to either fix the documentation or fix
> the implementation. I'll leave it to the community to comment on. The
> code right now disallows intervals provided in months and years,
> because they are not a "consistently" fixed amount of time. A month
> can be 28, 29, 30, or 31 days. A year is 12 months for sure, but is it
> 360 days (sometimes used in finance), 365 days or 366 days? 
>
> Therefore we could either:
>   1) Allow windowing when intervals are given in days and less, even
> though it could be 365 days, and fix the documentation.
>   2) Explicitly disallow it as there may be a lot of data for a given
> window, but partial aggregations should help with that.
>
> My thoughts are to go with 1. What do you think?
>
> Best,
> Burak
>
> On Wed, Jan 18, 2017 at 10:18 AM, Maciej Szymkiewicz
> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>
> Hi,
>
> Can I ask for some clarifications regarding intended behavior of
> window / TimeWindow?
>
> PySpark documentation states that "Windows in the order of months
> are not supported". This is further confirmed by the checks in
> TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l).
>
> Surprisingly enough we can pass interval much larger than a month
> by expressing interval in days or another unit of a higher
> precision. So this fails:
>
> Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))
>
> while following is accepted:
>
> Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))
>
> with results which look sensible at first glance.
>
> Is it a matter of a faulty validation logic (months will be
> assigned only if there is a match against years or months
> https://git.io/vMPdi) or expected behavior and I simply
> misunderstood the intentions?
>
> -- 
> Best,
> Maciej
>
>



[SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Maciej Szymkiewicz
Hi,

Can I ask for some clarifications regarding intended behavior of window
/ TimeWindow?

PySpark documentation states that "Windows in the order of months are
not supported". This is further confirmed by the checks in
TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l).

Surprisingly enough we can pass interval much larger than a month by
expressing interval in days or another unit of a higher precision. So
this fails:

Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))

while following is accepted:

Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))

with results which look sensible at first glance.

Is it a matter of a faulty validation logic (months will be assigned
only if there is a match against years or months https://git.io/vMPdi)
or expected behavior and I simply misunderstood the intentions?

-- 
Best,
Maciej



  1   2   >