from:"Nicholas Chammas"

Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Nicholas Chammas

[dev list to bcc]

This is a question for the user list  
or for Stack Overflow 
. The dev list is for 
discussions related to the development of Spark itself.

Nick


> On May 21, 2024, at 6:58 AM, Prem Sahoo  wrote:
> 
> Hello Vibhor,
> Thanks for the suggestion .
> I am looking for some other alternatives where I can use the same dataframe 
> can be written to two destinations without re execution and cache or persist .
> 
> Can some one help me in scenario 2 ?
> How to make spark write to MinIO faster ?
> Sent from my iPhone
> 
>> On May 21, 2024, at 1:18 AM, Vibhor Gupta  wrote:
>> 
>> 
>> Hi Prem,
>>  
>> You can try to write to HDFS then read from HDFS and write to MinIO.
>>  
>> This will prevent duplicate transformation.
>>  
>> You can also try persisting the dataframe using the DISK_ONLY level.
>>  
>> Regards,
>> Vibhor
>> From: Prem Sahoo 
>> Date: Tuesday, 21 May 2024 at 8:16 AM
>> To: Spark dev list 
>> Subject: EXT: Dual Write to HDFS and MinIO in faster way
>> 
>> EXTERNAL: Report suspicious emails to Email Abuse.
>> 
>> Hello Team,
>> I am planning to write to two datasource at the same time . 
>>  
>> Scenario:-
>>  
>> Writing the same dataframe to HDFS and MinIO without re-executing the 
>> transformations and no cache(). Then how can we make it faster ?
>>  
>> Read the parquet file and do a few transformations and write to HDFS and 
>> MinIO.
>>  
>> here in both write spark needs execute the transformation again. Do we know 
>> how we can avoid re-execution of transformation  without cache()/persist ?
>>  
>> Scenario2 :-
>> I am writing 3.2G data to HDFS and MinIO which takes ~6mins.
>> Do we have any way to make writing this faster ?
>>  
>> I don't want to do repartition and write as repartition will have overhead 
>> of shuffling .
>>  
>> Please provide some inputs.

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-12 Thread Nicholas Chammas

Re: unification

We also have a long-standing problem with how we manage Python dependencies, 
something I’ve tried (unsuccessfully 
) to fix in the past.

Consider, for example, how many separate places this numpy dependency is 
installed:

1. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L277
2. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L733
3. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L853
4. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L871
5. 
https://github.com/apache/spark/blob/8094535973f19e9f0543535a97254e8ebffc1b23/.github/workflows/build_python_connect35.yml#L70
6. 
https://github.com/apache/spark/blob/553e1b85c42a60c082d33f7b9df53b0495893286/.github/workflows/maven_test.yml#L181
7. 
https://github.com/apache/spark/blob/6e5d1db9058de62a45f35d3f41e028a72f688b70/dev/requirements.txt#L5
8. 
https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L90
9. 
https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L99
10. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/dev/create-release/spark-rm/Dockerfile#L40
11. 
https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L89
12. 
https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L92

None of those installations reference a unified version requirement, so 
naturally they are inconsistent across all these different lines. Some say 
`>=1.21`, others say `>=1.20.0`, and still others say `==1.20.3`. In several 
cases there is no version requirement specified at all.

I’m interested in trying again to fix this problem, but it needs to be in 
collaboration with a committer since I cannot fully test the release scripts. 
(This testing gap is what doomed my last attempt at fixing this problem.)

Nick


> On May 13, 2024, at 12:18 AM, Wenchen Fan  wrote:
> 
> After finishing the 4.0.0-preview1 RC1, I have more experience with this 
> topic now.
> 
> In fact, the main job of the release process: building packages and 
> documents, is tested in Github Action jobs. However, the way we test them is 
> different from what we do in the release scripts.
> 
> 1. the execution environment is different:
> The release scripts define the execution environment with this Dockerfile: 
> https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
> However, Github Action jobs use a different Dockerfile: 
> https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
> We should figure out a way to unify it. The docker image for the release 
> process needs to set up more things so it may not be viable to use a single 
> Dockerfile for both.
> 
> 2. the execution code is different. Use building documents as an example:
> The release scripts: 
> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
> The Github Action job: 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
> I don't know which one is more correct, but we should definitely unify them.
> 
> It's better if we can run the release scripts as Github Action jobs, but I 
> think it's more important to do the unification now.
> 
> Thanks,
> Wenchen
> 
> 
> On Fri, May 10, 2024 at 12:34 AM Hussein Awala  > wrote:
>> Hello,
>> 
>> I can answer some of your common questions with other Apache projects.
>> 
>> > Who currently has permissions for Github actions? Is there a specific 
>> > owner for that today or a different volunteer each time?
>> 
>> The Apache organization owns Github Actions, and committers (contributors 
>> with write permissions) can retrigger/cancel a Github Actions workflow, but 
>> Github Actions runners are managed by the Apache infra team.
>> 
>> > What are the current limits of GitHub Actions, who set them - and what is 
>> > the process to change those (if possible at all, but I presume not all 
>> > Apache projects have the same limits)?
>> 
>> For limits, I don't think there is any significant limit, especially since 
>> the Apache organization has 900 donated runners used by its projects, and 
>> there is an initiative from the Infra team to add self-hosted runners 
>> running on Kubernetes (document 
>> ).
>> 
>> > Where should the artifacts be stored?
>> 
>> Usually, we use Maven for jars, DockerHub for Docker images, and Github 
>> cache for workflow cache. But we can use Github artifacts to store any kind 
>> of package (even Docker images in the

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-12 Thread Nicholas Chammas

This is a side issue, but I’d like to bring people’s attention to SPARK-28024. 

Cases 2, 3, and 4 described in that ticket are still problems today on master 
(I just rechecked) even with ANSI mode enabled.

Well, maybe not problems, but I’m flagging this since Spark’s behavior differs 
in these cases from Postgres, as described in the ticket.


> On Apr 12, 2024, at 12:09 AM, Gengliang Wang  wrote:
> 
> 
> +1, enabling Spark's ANSI SQL mode in version 4.0 will significantly enhance 
> data quality and integrity. I fully support this initiative.
> 
> > In other words, the current Spark ANSI SQL implementation becomes the first 
> > implementation for Spark SQL users to face at first while providing
> `spark.sql.ansi.enabled=false` in the same way without losing any 
> capability.`spark.sql.ansi.enabled=false` in the same way without losing any 
> capability.
> 
> BTW, the try_* 
> 
>  functions and SQL Error Attribution Framework 
>  will also be beneficial 
> in migrating to ANSI SQL mode.
> 
> 
> Gengliang
> 
> 
> On Thu, Apr 11, 2024 at 7:56 PM Dongjoon Hyun  > wrote:
>> Hi, All.
>> 
>> Thanks to you, we've been achieving many things and have on-going SPIPs.
>> I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more narrowly
>> by asking your opinions about Apache Spark's ANSI SQL mode.
>> 
>> https://issues.apache.org/jira/browse/SPARK-44111
>> Prepare Apache Spark 4.0.0
>> 
>> SPARK-4 was proposed last year (on 15/Jul/23) as the one of desirable
>> items for 4.0.0 because it's a big behavior.
>> 
>> https://issues.apache.org/jira/browse/SPARK-4
>> Use ANSI SQL mode by default
>> 
>> Historically, spark.sql.ansi.enabled was added at Apache Spark 3.0.0 and has
>> been aiming to provide a better Spark SQL compatibility in a standard way.
>> We also have a daily CI to protect the behavior too.
>> 
>> https://github.com/apache/spark/actions/workflows/build_ansi.yml
>> 
>> However, it's still behind the configuration with several known issues, e.g.,
>> 
>> SPARK-41794 Reenable ANSI mode in test_connect_column
>> SPARK-41547 Reenable ANSI mode in test_connect_functions
>> SPARK-46374 Array Indexing is 1-based via ANSI SQL Standard
>> 
>> To be clear, we know that many DBMSes have their own implementations of
>> SQL standard and not the same. Like them, SPARK-4 aims to enable
>> only the existing Spark's configuration, `spark.sql.ansi.enabled=true`.
>> There is nothing more than that.
>> 
>> In other words, the current Spark ANSI SQL implementation becomes the first
>> implementation for Spark SQL users to face at first while providing
>> `spark.sql.ansi.enabled=false` in the same way without losing any capability.
>> 
>> If we don't want this change for some reasons, we can simply exclude
>> SPARK-4 from SPARK-44111 as a part of Apache Spark 4.0.0 preparation.
>> It's time just to make a go/no-go decision for this item for the global 
>> optimization
>> for Apache Spark 4.0.0 release. After 4.0.0, it's unlikely for us to aim
>> for this again for the next four years until 2028.
>> 
>> WDYT?
>> 
>> Bests,
>> Dongjoon

Re: Generating config docs automatically

2024-02-22 Thread Nicholas Chammas

Thank you, Holden!

Yes, having everything live in the ConfigEntry is attractive.

The main reason I proposed an alternative where the groups are defined in YAML 
is that if the config groups are defined in ConfigEntry, then altering the 
groupings – which is relevant only to the display of config documentation – 
requires rebuilding Spark. This feels a bit off to me in terms of design.

For example, on the SQL performance tuning page there is some narrative 
documentation about caching 
<https://spark.apache.org/docs/3.5.0/sql-performance-tuning.html#caching-data-in-memory>,
 plus a table of relevant configs. If I want an additional config to show up in 
this table, I need to add it to the config group that backs the table.

With the ConfigEntry approach in #44755 
<https://github.com/apache/spark/pull/44755>, that means editing the 
appropriate ConfigEntry and rebuilding Spark before I can regenerate the config 
table.

val SOME_CONFIG = buildConf("spark.sql.someCachingRelatedConfig")
  .doc("some documentation")
  .version("2.1.0")
  .withDocumentationGroup("sql-tuning-caching-data")  // assign group to the 
config
With the YAML approach in #44756 <https://github.com/apache/spark/pull/44756>, 
that means editing the config group defined in the YAML file and regenerating 
the config table. No Spark rebuild required.

sql-tuning-caching-data:
- spark.sql.inMemoryColumnarStorage.compressed
- spark.sql.inMemoryColumnarStorage.batchSize
- spark.sql.someCachingRelatedConfig  # add config to the group
In both cases the config names, descriptions, defaults, etc. will be pulled 
from the ConfigEntry when building the HTML tables.

I prefer the latter approach but I’m open to whatever committers are more 
comfortable with. If you prefer the former, then I’ll focus on that and ping 
you for reviews accordingly!

> On Feb 21, 2024, at 11:43 AM, Holden Karau  wrote:
> 
> I think this is a good idea. I like having everything in one source of truth 
> rather than two (so option 1 sounds like a good idea); but that’s just my 
> opinion. I'd be happy to help with reviews though.
> 
> On Wed, Feb 21, 2024 at 6:37 AM Nicholas Chammas  <mailto:nicholas.cham...@gmail.com>> wrote:
>> I know config documentation is not the most exciting thing. If there is 
>> anything I can do to make this as easy as possible for a committer to 
>> shepherd, I’m all ears!
>> 
>> 
>>> On Feb 14, 2024, at 8:53 PM, Nicholas Chammas >> <mailto:nicholas.cham...@gmail.com>> wrote:
>>> 
>>> I’m interested in automating our config documentation and need input from a 
>>> committer who is interested in shepherding this work.
>>> 
>>> We have around 60 tables of configs across our documentation. Here’s a 
>>> typical example. 
>>> <https://github.com/apache/spark/blob/736d8ab3f00e7c5ba1b01c22f6398b636b8492ea/docs/sql-performance-tuning.md?plain=1#L65-L159>
>>> 
>>> These tables span several thousand lines of manually maintained HTML, which 
>>> poses a few problems:
>>> The documentation for a given config is sometimes out of sync across the 
>>> HTML table and its source `ConfigEntry`.
>>> Internal configs that are not supposed to be documented publicly sometimes 
>>> are.
>>> Many config names and defaults are extremely long, posing formatting 
>>> problems.
>>> 
>>> Contributors waste time dealing with these issues in a losing battle to 
>>> keep everything up-to-date and consistent.
>>> 
>>> I’d like to solve all these problems by generating HTML tables 
>>> automatically from the `ConfigEntry` instances where the configs are 
>>> defined.
>>> 
>>> I’ve proposed two alternative solutions:
>>> #44755 <https://github.com/apache/spark/pull/44755>: Enhance `ConfigEntry` 
>>> so a config can be associated with one or more groups, and use that new 
>>> metadata to generate the tables we need.
>>> #44756 <https://github.com/apache/spark/pull/44756>: Add a standalone YAML 
>>> file where we define config groups, and use that to generate the tables we 
>>> need.
>>> 
>>> If you’re a committer and are interested in this problem, please chime in 
>>> on whatever approach appeals to you. If you think this is a bad idea, I’m 
>>> also eager to hear your feedback.
>>> 
>>> Nick
>>> 
> 
>

Re: Generating config docs automatically

2024-02-21 Thread Nicholas Chammas

I know config documentation is not the most exciting thing. If there is 
anything I can do to make this as easy as possible for a committer to shepherd, 
I’m all ears!


> On Feb 14, 2024, at 8:53 PM, Nicholas Chammas  
> wrote:
> 
> I’m interested in automating our config documentation and need input from a 
> committer who is interested in shepherding this work.
> 
> We have around 60 tables of configs across our documentation. Here’s a 
> typical example. 
> <https://github.com/apache/spark/blob/736d8ab3f00e7c5ba1b01c22f6398b636b8492ea/docs/sql-performance-tuning.md?plain=1#L65-L159>
> 
> These tables span several thousand lines of manually maintained HTML, which 
> poses a few problems:
> The documentation for a given config is sometimes out of sync across the HTML 
> table and its source `ConfigEntry`.
> Internal configs that are not supposed to be documented publicly sometimes 
> are.
> Many config names and defaults are extremely long, posing formatting problems.
> 
> Contributors waste time dealing with these issues in a losing battle to keep 
> everything up-to-date and consistent.
> 
> I’d like to solve all these problems by generating HTML tables automatically 
> from the `ConfigEntry` instances where the configs are defined.
> 
> I’ve proposed two alternative solutions:
> #44755 <https://github.com/apache/spark/pull/44755>: Enhance `ConfigEntry` so 
> a config can be associated with one or more groups, and use that new metadata 
> to generate the tables we need.
> #44756 <https://github.com/apache/spark/pull/44756>: Add a standalone YAML 
> file where we define config groups, and use that to generate the tables we 
> need.
> 
> If you’re a committer and are interested in this problem, please chime in on 
> whatever approach appeals to you. If you think this is a bad idea, I’m also 
> eager to hear your feedback.
> 
> Nick
>

Generating config docs automatically

2024-02-14 Thread Nicholas Chammas

I’m interested in automating our config documentation and need input from a 
committer who is interested in shepherding this work.

We have around 60 tables of configs across our documentation. Here’s a typical 
example. 


These tables span several thousand lines of manually maintained HTML, which 
poses a few problems:
The documentation for a given config is sometimes out of sync across the HTML 
table and its source `ConfigEntry`.
Internal configs that are not supposed to be documented publicly sometimes are.
Many config names and defaults are extremely long, posing formatting problems.

Contributors waste time dealing with these issues in a losing battle to keep 
everything up-to-date and consistent.

I’d like to solve all these problems by generating HTML tables automatically 
from the `ConfigEntry` instances where the configs are defined.

I’ve proposed two alternative solutions:
#44755 : Enhance `ConfigEntry` so a 
config can be associated with one or more groups, and use that new metadata to 
generate the tables we need.
#44756 : Add a standalone YAML file 
where we define config groups, and use that to generate the tables we need.

If you’re a committer and are interested in this problem, please chime in on 
whatever approach appeals to you. If you think this is a bad idea, I’m also 
eager to hear your feedback.

Nick

Re: How do you debug a code-generated aggregate?

2024-02-12 Thread Nicholas Chammas

OK, I figured it out. The details are in SPARK-47024 
<https://issues.apache.org/jira/browse/SPARK-47024> for anyone who’s interested.

It turned out to be a floating point arithmetic “bug”. The main reason I was 
able to figure it out was because I’ve been investigating another, unrelated 
bug (a real bug) related to floats, so these weird float corner cases have been 
top of mind.

If it weren't for that, I wonder how much progress I would have made. Though I 
could inspect the generated code, I couldn’t figure out how to get logging 
statements placed in the generated code to print somewhere I could see them.

Depending on how often we find ourselves debugging aggregates like this, it 
would be really helpful if we added some way to trace the aggregation buffer.

In any case, mystery solved. Thank you for the pointer!


> On Feb 12, 2024, at 8:39 AM, Herman van Hovell  wrote:
> 
> There is no really easy way of getting the state of the aggregation buffer, 
> unless you are willing to modify the code generation and sprinkle in some 
> logging.
> 
> What I would start with is dumping the generated code by calling 
> explain('codegen') on the DataFrame. That helped me to find similar issues in 
> most cases.
> 
> HTH
> 
> On Sun, Feb 11, 2024 at 11:26 PM Nicholas Chammas  <mailto:nicholas.cham...@gmail.com>> wrote:
>> Consider this example:
>> >>> from pyspark.sql.functions import sum
>> >>> spark.range(4).repartition(2).select(sum("id")).show()
>> +---+
>> |sum(id)|
>> +---+
>> |  6|
>> +---+
>> 
>> I’m trying to understand how this works because I’m investigating a bug in 
>> this kind of aggregate.
>> 
>> I see that doProduceWithoutKeys 
>> <https://github.com/apache/spark/blob/d02fbba6491fd17dc6bfc1a416971af7544952f3/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala#L98>
>>  and doConsumeWithoutKeys 
>> <https://github.com/apache/spark/blob/d02fbba6491fd17dc6bfc1a416971af7544952f3/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala#L193>
>>  are called, and I believe they are responsible for computing a declarative 
>> aggregate like `sum`. But I’m not sure how I would debug the generated code, 
>> or the inputs that drive what code gets generated.
>> 
>> Say you were running the above example and it was producing an incorrect 
>> result, and you knew the problem was somehow related to the sum. How would 
>> you troubleshoot it to identify the root cause?
>> 
>> Ideally, I would like some way to track how the aggregation buffer mutates 
>> as the computation is executed, so I can see something roughly like:
>> [0, 1, 2, 3]
>> [1, 5]
>> [6]
>> 
>> Is there some way to trace a declarative aggregate like this?
>> 
>> Nick
>>

How do you debug a code-generated aggregate?

2024-02-11 Thread Nicholas Chammas

Consider this example:
>>> from pyspark.sql.functions import sum
>>> spark.range(4).repartition(2).select(sum("id")).show()
+---+
|sum(id)|
+---+
|  6|
+---+
I’m trying to understand how this works because I’m investigating a bug in this 
kind of aggregate.

I see that doProduceWithoutKeys 

 and doConsumeWithoutKeys 

 are called, and I believe they are responsible for computing a declarative 
aggregate like `sum`. But I’m not sure how I would debug the generated code, or 
the inputs that drive what code gets generated.

Say you were running the above example and it was producing an incorrect 
result, and you knew the problem was somehow related to the sum. How would you 
troubleshoot it to identify the root cause?

Ideally, I would like some way to track how the aggregation buffer mutates as 
the computation is executed, so I can see something roughly like:
[0, 1, 2, 3]
[1, 5]
[6]
Is there some way to trace a declarative aggregate like this?

Nick

Re: Removing Kinesis in Spark 4

2024-01-20 Thread Nicholas Chammas

Oh, that’s a very interesting dashboard. I was familiar with the Matomo snippet 
but never looked up where exactly those metrics were going.

I see that the Kinesis docs do indeed have around 650 views in the past month, 
but for Kafka I see 11K and 1.3K views for the Structured Streaming and DStream 
docs, respectively. Big difference there, though maybe that's because Kinesis 
doesn’t have docs for structured streaming. Hard to tell.





These statistics also raise questions about the future of the R API, though 
that’s a topic for another thread.



Nick


> On Jan 20, 2024, at 1:05 PM, Sean Owen  wrote:
> 
> I'm not aware of much usage. but that doesn't mean a lot.
> 
> FWIW, in the past month or so, the Kinesis docs page got about 700 views, 
> compared to about 1400 for Kafka
> https://analytics.apache.org/index.php?module=CoreHome=index=yesterday=day=40#?idSite=40=range=2023-12-15,2024-01-20=General_Actions=Actions_SubmenuPageTitles
> 
> Those are "low" in general, compared to the views for streaming pages, which 
> got tens of thousands of views.
> 
> I do feel like it's unmaintained, and do feel like it might be a stretch to 
> leave it lying around until Spark 5.
> It's not exactly unused though.
> 
> I would not object to removing it unless there is some voice of support here.
> 
> On Sat, Jan 20, 2024 at 10:38 AM Nicholas Chammas  <mailto:nicholas.cham...@gmail.com>> wrote:
>> From the dev thread: What else could be removed in Spark 4? 
>> <https://lists.apache.org/thread/shxj7qmrtqbxqf85lrlsv6510892ktnz>
>>> On Aug 17, 2023, at 1:44 AM, Yang Jie >> <mailto:yangji...@apache.org>> wrote:
>>> 
>>> I would like to know how we should handle the two Kinesis-related modules 
>>> in Spark 4.0. They have a very low frequency of code updates, and because 
>>> the corresponding tests are not continuously executed in any GitHub Actions 
>>> pipeline, so I think they significantly lack quality assurance. On top of 
>>> that, I am not certain if the test cases, which require AWS credentials in 
>>> these modules, get verified during each Spark version release.
>> 
>> Did we ever reach a decision about removing Kinesis in Spark 4?
>> 
>> I was cleaning up some docs related to Kinesis and came across a reference 
>> to some Java API docs that I could not find 
>> <https://github.com/apache/spark/pull/44802#discussion_r1459337001>. And 
>> looking around I came across both this email thread and this thread on JIRA 
>> <https://issues.apache.org/jira/browse/SPARK-45720?focusedCommentId=17787227=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17787227>
>>  about potentially removing Kinesis.
>> 
>> But as far as I can tell we haven’t made a clear decision one way or the 
>> other.
>> 
>> Nick
>>

Removing Kinesis in Spark 4

2024-01-20 Thread Nicholas Chammas

From the dev thread: What else could be removed in Spark 4? 

> On Aug 17, 2023, at 1:44 AM, Yang Jie  wrote:
> 
> I would like to know how we should handle the two Kinesis-related modules in 
> Spark 4.0. They have a very low frequency of code updates, and because the 
> corresponding tests are not continuously executed in any GitHub Actions 
> pipeline, so I think they significantly lack quality assurance. On top of 
> that, I am not certain if the test cases, which require AWS credentials in 
> these modules, get verified during each Spark version release.

Did we ever reach a decision about removing Kinesis in Spark 4?

I was cleaning up some docs related to Kinesis and came across a reference to 
some Java API docs that I could not find 
. And 
looking around I came across both this email thread and this thread on JIRA 

 about potentially removing Kinesis.

But as far as I can tell we haven’t made a clear decision one way or the other.

Nick

Install Ruby 3 to build the docs

2024-01-10 Thread Nicholas Chammas

Just a quick heads up that, while Ruby 2.7 will continue to work, you should 
plan to install Ruby 3 in the near future in order to build the docs. (I 
recommend using rbenv  to manage multiple Ruby 
versions.)

Ruby 2 reached EOL in March 2023 
. We will be 
unable to upgrade some of our Ruby dependencies to their latest versions until 
we are using Ruby 3.

This is not a problem today but will likely become a problem in the near future.

For more details, please refer to this pull request 
.

Best,
Nick

Re: Validate spark sql

2023-12-24 Thread Nicholas Chammas

This is a user-list question, not a dev-list question. Moving this conversation 
to the user list and BCC-ing the dev list.

Also, this statement

> We are not validating against table or column existence.

is not correct. When you call spark.sql(…), Spark will lookup the table 
references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.

Also, when you run DDL via spark.sql(…), Spark will actually run it. So 
spark.sql(“drop table my_table”) will actually drop my_table. It’s not a 
validation-only operation.

This question of validating SQL is already discussed on Stack Overflow 
. You may find some useful tips 
there.

Nick


> On Dec 24, 2023, at 4:52 AM, Mich Talebzadeh  
> wrote:
> 
>   
> Yes, you can validate the syntax of your PySpark SQL queries without 
> connecting to an actual dataset or running the queries on a cluster.
> PySpark provides a method for syntax validation without executing the query. 
> Something like below
>   __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.4.0
>   /_/
> 
> Using Python version 3.9.16 (main, Apr 24 2023 10:36:11)
> Spark context Web UI available at http://rhes75:4040 
> Spark context available as 'sc' (master = local[*], app id = 
> local-1703410019374).
> SparkSession available as 'spark'.
> >>> from pyspark.sql import SparkSession
> >>> spark = SparkSession.builder.appName("validate").getOrCreate()
> 23/12/24 09:28:02 WARN SparkSession: Using an existing Spark session; only 
> runtime SQL configurations will take effect.
> >>> sql = "SELECT * FROM  WHERE  = some value"
> >>> try:
> ...   spark.sql(sql)
> ...   print("is working")
> ... except Exception as e:
> ...   print(f"Syntax error: {e}")
> ...
> Syntax error:
> [PARSE_SYNTAX_ERROR] Syntax error at or near '<'.(line 1, pos 14)
> 
> == SQL ==
> SELECT * FROM  WHERE  = some value
> --^^^
> 
> Here we only check for syntax errors and not the actual existence of query 
> semantics. We are not validating against table or column existence.
> 
> This method is useful when you want to catch obvious syntax errors before 
> submitting your PySpark job to a cluster, especially when you don't have 
> access to the actual data.
> In summary
> Theis method validates syntax but will not catch semantic errors
> If you need more comprehensive validation, consider using a testing framework 
> and a small dataset.
> For complex queries, using a linter or code analysis tool can help identify 
> potential issues.
> HTH
> 
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
> 
>view my Linkedin profile 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Sun, 24 Dec 2023 at 07:57, ram manickam  > wrote:
>> Hello,
>> Is there a way to validate pyspark sql to validate only syntax errors?. I 
>> cannot connect do actual data set to perform this validation.  Any help 
>> would be appreciated.
>> 
>> 
>> Thanks
>> Ram

Guidance for filling out "Affects Version" on Jira

2023-12-17 Thread Nicholas Chammas

The Contributing guide  only 
mentions what to fill in for “Affects Version” for bugs. How about for 
improvements?

This question once caused some problems when I set “Affects Version” to the 
last released version, and that was interpreted as a request to backport an 
improvement, which was not my intention and caused a minor kerfuffle.

Could we provide some clarity on the Contributing guide on how to fill in this 
field for improvements, especially since “Affects Version” is required on all 
Jira tickets?

Nick

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas

Where exactly are you getting this information from?

As far as I can tell, spark.sql.cbo.enabled has defaulted to false since it was 
introduced 7 years ago 
<https://github.com/apache/spark/commit/ae83c211257c508989c703d54f2aeec8b2b5f14d#diff-9ed2b0b7829b91eafb43e040a15247c90384e42fea1046864199fbad77527bb5R649>.
 It has never been enabled by default.

And I cannot see mention of spark.sql.cbo.strategy anywhere at all in the code 
base.

So again, where is this information coming from? Please link directly to your 
source.



> On Dec 11, 2023, at 5:45 PM, Mich Talebzadeh  
> wrote:
> 
> You are right. By default CBO is not enabled. Whilst the CBO was the default 
> optimizer in earlier versions of Spark, it has been replaced by the AQE in 
> recent releases.
> 
> spark.sql.cbo.strategy
> 
> As I understand, The spark.sql.cbo.strategy configuration property specifies 
> the optimizer strategy used by Spark SQL to generate query execution plans. 
> There are two main optimizer strategies available:
> CBO (Cost-Based Optimization): The default optimizer strategy, which analyzes 
> the query plan and estimates the execution costs associated with each 
> operation. It uses statistics to guide its decisions, selecting the plan with 
> the lowest estimated cost.
> 
> CBO-Like (Cost-Based Optimization-Like): A simplified optimizer strategy that 
> mimics some of the CBO's logic, but without the ability to estimate costs. 
> This strategy is faster than CBO for simple queries, but may not produce the 
> most efficient plan for complex queries.
> 
> The spark.sql.cbo.strategy property can be set to either CBO or CBO-Like. The 
> default value is AUTO, which means that Spark will automatically choose the 
> most appropriate strategy based on the complexity of the query and 
> availability of statistic
> 
> 
> 
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
> 
>view my Linkedin profile 
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Mon, 11 Dec 2023 at 17:11, Nicholas Chammas  <mailto:nicholas.cham...@gmail.com>> wrote:
>> 
>>> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh >> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>> By default, the CBO is enabled in Spark.
>> 
>> Note that this is not correct. AQE is enabled 
>> <https://github.com/apache/spark/blob/8235f1d56bf232bb713fe24ff6f2ffdaf49d2fcc/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L664-L669>
>>  by default, but CBO isn’t 
>> <https://github.com/apache/spark/blob/8235f1d56bf232bb713fe24ff6f2ffdaf49d2fcc/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2694-L2699>.

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas


> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh  
> wrote:
> spark.sql.cbo.strategy: Set to AUTO to use the CBO as the default optimizer, 
> or NONE to disable it completely.
> 
Hmm, I’ve also never heard of this setting before and can’t seem to find it in 
the Spark docs or source code.

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas


> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh  
> wrote:
> 
> By default, the CBO is enabled in Spark.

Note that this is not correct. AQE is enabled 

 by default, but CBO isn’t 
.

Re: When and how does Spark use metastore statistics?

2023-12-10 Thread Nicholas Chammas

I’ve done some reading and have a slightly better understanding of statistics 
now.

Every implementation of LeafNode.computeStats 
<https://github.com/apache/spark/blob/7cea52c96f5be1bc565a033bfd77370ab5527a35/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L210>
 offers its own way to get statistics:

LocalRelation 
<https://github.com/apache/spark/blob/8ff6b7a04cbaef9c552789ad5550ceab760cb078/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala#L97>
 estimates the size of the relation directly from the row count.
HiveTableRelation 
<https://github.com/apache/spark/blob/8e95929ac4238d02dca379837ccf2fbc1cd1926d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L923-L929>
 pulls those statistics from the catalog or metastore.
DataSourceV2Relation 
<https://github.com/apache/spark/blob/5fec76dc8db2499b0a9d76231f9a250871d59658/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L100>
 delegates the job of computing statistics to the underlying data source.
There are a lot of details I’m still fuzzy on, but I think that’s the gist of 
things.

Would it make sense to add a paragraph or two to the SQL performance tuning 
page <https://spark.apache.org/docs/latest/sql-performance-tuning.html> 
covering statistics at a high level? Something that briefly explains:

what statistics are and how Spark uses them to optimize plans
the various ways Spark computes or loads statistics (catalog, data source, 
runtime, etc.)
how to gather catalog statistics (i.e. pointer to ANALYZE TABLE)
how to check statistics on an object (i.e. DESCRIBE EXTENDED) and as part of an 
optimized plan (i.e. .explain(mode="cost"))
what the cost-based optimizer does and how to enable it
Would this be a welcome addition to the project’s documentation? I’m happy to 
work on this.

> On Dec 5, 2023, at 12:12 PM, Nicholas Chammas  
> wrote:
> 
> I’m interested in improving some of the documentation relating to the table 
> and column statistics that get stored in the metastore, and how Spark uses 
> them.
> 
> But I’m not clear on a few things, so I’m writing to you with some questions.
> 
> 1. The documentation for spark.sql.autoBroadcastJoinThreshold 
> <https://spark.apache.org/docs/latest/sql-performance-tuning.html> implies 
> that it depends on table statistics to work, but it’s not clear. Is it 
> accurate to say that unless you have run ANALYZE on the tables participating 
> in a join, spark.sql.autoBroadcastJoinThreshold cannot impact the execution 
> plan?
> 
> 2. As a follow-on to the above question, the adaptive version of 
> autoBroadcastJoinThreshold, namely 
> spark.sql.adaptive.autoBroadcastJoinThreshold, may still kick in, because it 
> depends only on runtime statistics and not statistics in the metastore. Is 
> that correct? I am assuming that “runtime statistics” are gathered on the fly 
> by Spark, but I would like to mention this in the docs briefly somewhere.
> 
> 3. The documentation for spark.sql.inMemoryColumnarStorage.compressed 
> <https://spark.apache.org/docs/latest/sql-performance-tuning.html> mentions 
> “statistics”, but it’s not clear what kind of statistics we’re talking about. 
> Are those runtime statistics, metastore statistics (that depend on you 
> running ANALYZE), or something else?
> 
> 4. The documentation for ANALYZE TABLE 
> <https://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html> 
> states that the collected statistics help the optimizer "find a better query 
> execution plan”. I wish we could link to something from here with more 
> explanation. Currently, spark.sql.autoBroadcastJoinThreshold is the only 
> place where metastore statistics are explicitly referenced as impacting the 
> execution plan. Surely there must be other places, no? Would it be 
> appropriate to mention the cost-based optimizer framework 
> <https://issues.apache.org/jira/browse/SPARK-16026> somehow? It doesn’t 
> appear to have any public documentation outside of Jira.
> 
> Any pointers or information you can provide would be very helpful. Again, I 
> am interested in contributing some documentation improvements relating to 
> statistics, but there is a lot I’m not sure about.
> 
> Nick
>

Re: Algolia search on website is broken

2023-12-10 Thread Nicholas Chammas

Pinging Gengliang and Xiao about this, per these docs 
<https://github.com/apache/spark-website/blob/0ceaaaf528ec1d0201e1eab1288f37cce607268b/release-process.md#update-the-configuration-of-algolia-crawler>.

It looks like to fix this problem you need access to the Algolia Crawler Admin 
Console.

> On Dec 5, 2023, at 11:28 AM, Nicholas Chammas  
> wrote:
> 
> Should I report this instead on Jira? Apologies if the dev list is not the 
> right place.
> 
> Search on the website appears to be broken. For example, here is a search for 
> “analyze”:
> 

> 
> And here is the same search using DDG 
> <https://duckduckgo.com/?q=site:https://spark.apache.org/docs/latest/+analyze=osx=web>.
> 
> Nick
>

Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Nicholas Chammas

This is not a question for the dev list. Moving dev to bcc.

One thing I would try is to connect to this database using JDBC + SSH tunnel, 
but without Spark. That way you can focus on getting the JDBC connection to 
work without Spark complicating the picture for you.


> On Dec 5, 2023, at 8:12 PM, Venkatesan Muniappan 
>  wrote:
> 
> Hi Team,
> 
> I am facing an issue with SSH Tunneling in Apache Spark. The behavior is same 
> as the one in this Stackoverflow question 
> 
>  but there are no answers there.
> 
> This is what I am trying:
> 
> 
> with SSHTunnelForwarder(
> (ssh_host, ssh_port),
> ssh_username=ssh_user,
> ssh_pkey=ssh_key_file,
> remote_bind_address=(sql_hostname, sql_port),
> local_bind_address=(local_host_ip_address, sql_port)) as tunnel:
> tunnel.local_bind_port
> b1_semester_df = spark.read \
> .format("jdbc") \
> .option("url", b2b_mysql_url.replace("<>", 
> str(tunnel.local_bind_port))) \
> .option("query", b1_semester_sql) \
> .option("database", 'b2b') \
> .option("password", b2b_mysql_password) \
> .option("driver", "com.mysql.cj.jdbc.Driver") \
> .load()
> b1_semester_df.count()
> 
> Here, the b1_semester_df is loaded but when I try count on the same Df it 
> fails saying this
> 
> 23/12/05 11:49:17 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
> aborting job
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 382, in show
> print(self._jdf.showString(n, 20, vertical))
>   File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
> line 1257, in __call__
>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o284.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 11, ip-172-32-108-1.eu-central-1.compute.internal, executor 3): 
> com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link 
> failure
> 
> However, the same is working fine with pandas df. I have tried this below and 
> it worked.
> 
> 
> with SSHTunnelForwarder(
> (ssh_host, ssh_port),
> ssh_username=ssh_user,
> ssh_pkey=ssh_key_file,
> remote_bind_address=(sql_hostname, sql_port)) as tunnel:
> conn = pymysql.connect(host=local_host_ip_address, user=sql_username,
>passwd=sql_password, db=sql_main_database,
>port=tunnel.local_bind_port)
> df = pd.read_sql_query(b1_semester_sql, conn)
> spark.createDataFrame(df).createOrReplaceTempView("b1_semester")
> 
> So wanted to check what I am missing with my Spark usage. Please help.
> 
> Thanks,
> Venkat
>

When and how does Spark use metastore statistics?

2023-12-05 Thread Nicholas Chammas

I’m interested in improving some of the documentation relating to the table and 
column statistics that get stored in the metastore, and how Spark uses them.

But I’m not clear on a few things, so I’m writing to you with some questions.

1. The documentation for spark.sql.autoBroadcastJoinThreshold 
 implies that 
it depends on table statistics to work, but it’s not clear. Is it accurate to 
say that unless you have run ANALYZE on the tables participating in a join, 
spark.sql.autoBroadcastJoinThreshold cannot impact the execution plan?

2. As a follow-on to the above question, the adaptive version of 
autoBroadcastJoinThreshold, namely 
spark.sql.adaptive.autoBroadcastJoinThreshold, may still kick in, because it 
depends only on runtime statistics and not statistics in the metastore. Is that 
correct? I am assuming that “runtime statistics” are gathered on the fly by 
Spark, but I would like to mention this in the docs briefly somewhere.

3. The documentation for spark.sql.inMemoryColumnarStorage.compressed 
 mentions 
“statistics”, but it’s not clear what kind of statistics we’re talking about. 
Are those runtime statistics, metastore statistics (that depend on you running 
ANALYZE), or something else?

4. The documentation for ANALYZE TABLE 
 
states that the collected statistics help the optimizer "find a better query 
execution plan”. I wish we could link to something from here with more 
explanation. Currently, spark.sql.autoBroadcastJoinThreshold is the only place 
where metastore statistics are explicitly referenced as impacting the execution 
plan. Surely there must be other places, no? Would it be appropriate to mention 
the cost-based optimizer framework 
 somehow? It doesn’t appear 
to have any public documentation outside of Jira.

Any pointers or information you can provide would be very helpful. Again, I am 
interested in contributing some documentation improvements relating to 
statistics, but there is a lot I’m not sure about.

Nick

Algolia search on website is broken

2023-12-05 Thread Nicholas Chammas

Should I report this instead on Jira? Apologies if the dev list is not the 
right place.

Search on the website appears to be broken. For example, here is a search for 
“analyze”:



And here is the same search using DDG 
.

Nick

Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Nicholas Chammas

I’ve always considered DataFrames to be logically equivalent to SQL tables or 
queries.

In SQL, the result order of any query is implementation-dependent without an 
explicit ORDER BY clause. Technically, you could run `SELECT * FROM table;` 10 
times in a row and get 10 different orderings.

I thought the same applied to DataFrames, but the docstring for the recently 
added method DataFrame.offset 

 implies otherwise.

This example will work fine in practice, of course. But if DataFrames are 
technically unordered without an explicit ordering clause, then in theory a 
future implementation change may result in “Bob" being the “first” row in the 
DataFrame, rather than “Tom”. That would make the example incorrect.

Is that not the case?

Nick

Allowing all Reader or Writer settings to be provided as options

2022-08-09 Thread Nicholas Chammas

Hello people,

I want to bring some attention to SPARK-39630 
 and ask if there are any 
design objections to the idea proposed there.

The gist of the proposal is that there are some reader or writer directives 
that cannot be supplied as options, like the format, write mode, or 
partitioning settings. Allowing those directives to be specified as options too 
means that it will become possible to fully represent a reader or writer as a 
map of options and reconstruct it from that.

This makes certain workflows more natural, especially when you are trying to 
manage reader or writer configurations declaratively.

Is there some design reason not to enable this, or is it just a matter of doing 
the work?

Feel free to comment either here or on the ticket.

Nick

Re: Deluge of GitBox emails

2022-04-04 Thread Nicholas Chammas

I’m not familiar with GitBox, but it must be an independent thing. When you 
participate in a PR, GitHub emails you notifications directly.

The GitBox emails, on the other hand, are going to the dev list. They seem like 
something setup as a repo-wide setting, or perhaps as an Apache bot that 
monitors repo activity and converts it into emails. (I’ve seen other projects 
-- I think Hadoop -- where GitHub activity is converted into comments on Jira.

Turning off these GitBox emails should not have in impact on the usual GitHub 
emails we are all already familiar with.


> On Apr 4, 2022, at 9:47 AM, Sean Owen  wrote:
> 
> I think this must be related to the Gitbox migration that just happened. It 
> does seem like I'm getting more emails - some are on PRs I'm attached to, but 
> some I don't recognize. The thing is, I'm not yet clear if they duplicate the 
> normal Github emails - that is if we turn them off do we have anything?
> 
> On Mon, Apr 4, 2022 at 8:44 AM Nicholas Chammas  <mailto:nicholas.cham...@gmail.com>> wrote:
> I assume I’m not the only one getting these new emails from GitBox. Is there 
> a story behind that that I missed?
> 
> I’d rather not get these emails on the dev list. I assume most of the list 
> would agree with me.
> 
> GitHub has a good set of options for following activity on the repo. People 
> who want to follow conversations can easily do that without involving the 
> whole dev list.
> 
> Do we know who is responsible for these GitBox emails? Perhaps we need to 
> file an Apache INFRA ticket?
> 
> Nick
> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> <mailto:dev-unsubscr...@spark.apache.org>
>

Deluge of GitBox emails

2022-04-04 Thread Nicholas Chammas

I assume I’m not the only one getting these new emails from GitBox. Is there a 
story behind that that I missed?

I’d rather not get these emails on the dev list. I assume most of the list 
would agree with me.

GitHub has a good set of options for following activity on the repo. People who 
want to follow conversations can easily do that without involving the whole dev 
list.

Do we know who is responsible for these GitBox emails? Perhaps we need to file 
an Apache INFRA ticket?

Nick


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] Rename 'SQL' to 'SQL / DataFrame', and 'Query' to 'Execution' in SQL UI page

2022-03-28 Thread Nicholas Chammas

+1

Understanding the close relationship between SQL and DataFrames in Spark was a 
key learning moment for me, but I agree that using the terms interchangeably 
can be confusing.


> On Mar 27, 2022, at 9:27 PM, Hyukjin Kwon  wrote:
> 
> *for some reason, the image looks broken (to me). I am attaching again to 
> make sure.
> 
> 
> 
> On Mon, 28 Mar 2022 at 10:22, Hyukjin Kwon  > wrote:
> Hi all,
> 
> I have been investigating the improvements for Pandas API on Spark 
> specifically in UI.
> I chatted with a couple of people, and decided to send an email here to 
> discuss more.
> 
> Currently, both SQL and DataFrame API are shown in “SQL” tab as below:
> 
> 
> 
> which makes sense to developers because DataFrame API shares the same SQL 
> core but
> I do believe this makes less sense to end users. Please consider two more 
> points:
> 
> Spark ML users will run DataFrame-based MLlib API, but they will have to 
> check the "SQL" tab.
> Pandas API on Spark arguably has no link to SQL itself conceptually. It makes 
> less sense to users of pandas API.
> 
> So I would like to propose to rename:
> "SQL" to "SQL/DataFrame"
> "Query" to "Execution"
> 
> There's a PR open at https://github.com/apache/spark/pull/35973 
> . Please let me know your 
> thoughts on this. 
> 
> Thanks.

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Nicholas Chammas

Thanks for the suggestions. I suppose I should share a bit more about what
I tried/learned, so others who come later can understand why a
memory-efficient, exact median is not in Spark.

Spark's own ApproximatePercentile also uses QuantileSummaries internally
<https://github.com/apache/spark/blob/3f3201a7882b817a8a3ecbfeb369dde01e7689d8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala#L225-L237>.
QuantileSummaries is a helper class for computing approximate quantiles
with a single pass over the data. I don't think I can use it to compute an
exact median.

Spark does already have code to compute an exact median: Percentile
<https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L74-L80>.
Since it works like other Catalyst expressions, it computes the median with
a single pass over the data. It does that by loading all the data into a
buffer and sorting it in memory
<https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L209>.
This is why the leading comment on Percentile warns that too much data will
cause GC pauses and OOMs
<https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L37-L39>
.

So I think this is what Reynold was getting at: With the design of Catalyst
expressions as they are today, there is no way to save memory by making
multiple passes over the data. So an approximate median is your best bet if
you want to avoid high memory usage.

You can build an exact median by doing other things, like multiple passes
over the data, or by using window functions, but that can't be captured in
a Catalyst Expression.

On Wed, Dec 15, 2021 at 11:00 AM Fitch, Simeon  wrote:

> Nicholas,
>
> This may or may not be much help, but in RasterFrames we have an
> approximate quantiles Expression computed against Tiles (2d geospatial
> arrays) which makes use of
> `org.apache.spark.sql.catalyst.util.QuantileSummaries` to do the hard work.
> So perhaps a directionally correct example of doing what you look to do?
>
>
> https://github.com/locationtech/rasterframes/blob/develop/core/src/main/scala/org/locationtech/rasterframes/expressions/aggregates/ApproxCellQuantilesAggregate.scala
>
> In that same package are a number of other Aggregates, including
> declarative ones, which are another way of computing aggregations through
> composition of other Expressions.
>
> Simeon
>
>
>
>
>
> On Thu, Dec 9, 2021 at 9:26 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I'm trying to create a new aggregate function. It's my first time working
>> with Catalyst, so it's exciting---but I'm also in a bit over my head.
>>
>> My goal is to create a function to calculate the median
>> <https://issues.apache.org/jira/browse/SPARK-26589>.
>>
>> As a very simple solution, I could just define median to be an alias of 
>> `Percentile(col,
>> 0.5)`. However, the leading comment on the Percentile expression
>> <https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L37-L39>
>> highlights that it's very memory-intensive and can easily lead to
>> OutOfMemory errors.
>>
>> So instead of using Percentile, I'm trying to create an Expression that
>> calculates the median without needing to hold everything in memory at once.
>> I'm considering two different approaches:
>>
>> 1. Define Median as a combination of existing expressions: The median
>> can perhaps be built out of the existing expressions for Count
>> <https://github.com/apache/spark/blob/9af338cd685bce26abbc2dd4d077bde5068157b1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L48>
>> and NthValue
>> <https://github.com/apache/spark/blob/568ad6aa4435ce76ca3b5d9966e64259ea1f9b38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L675>
>> .
>>
>> I don't see a template I can follow for building a new expression out of
>> existing expressions (i.e. without having to implement a bunch of methods
>> for DeclarativeAggregate or ImperativeAggregate). I also don't know how I
>> would wrap NthValue to make it usable as a regular aggregate function. The
>> wrapped NthValue would need an implicit window that provides the necessary
>> ordering.
>>
>>
>&g

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Nicholas Chammas

Yeah, I think approximate percentile is good enough most of the time.

I don't have a specific need for a precise median. I was interested in
implementing it more as a Catalyst learning exercise, but it turns out I
picked a bad learning exercise to solve. :)

On Mon, Dec 13, 2021 at 9:46 PM Reynold Xin  wrote:

> tl;dr: there's no easy way to implement aggregate expressions that'd
> require multiple pass over data. It is simply not something that's
> supported and doing so would be very high cost.
>
> Would you be OK using approximate percentile? That's relatively cheap.
>
>
>
> On Mon, Dec 13, 2021 at 6:43 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> No takers here? :)
>>
>> I can see now why a median function is not available in most data
>> processing systems. It's pretty annoying to implement!
>>
>> On Thu, Dec 9, 2021 at 9:25 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I'm trying to create a new aggregate function. It's my first time
>>> working with Catalyst, so it's exciting---but I'm also in a bit over my
>>> head.
>>>
>>> My goal is to create a function to calculate the median
>>> <https://issues.apache.org/jira/browse/SPARK-26589>.
>>>
>>> As a very simple solution, I could just define median to be an alias of 
>>> `Percentile(col,
>>> 0.5)`. However, the leading comment on the Percentile expression
>>> <https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L37-L39>
>>> highlights that it's very memory-intensive and can easily lead to
>>> OutOfMemory errors.
>>>
>>> So instead of using Percentile, I'm trying to create an Expression that
>>> calculates the median without needing to hold everything in memory at once.
>>> I'm considering two different approaches:
>>>
>>> 1. Define Median as a combination of existing expressions: The median
>>> can perhaps be built out of the existing expressions for Count
>>> <https://github.com/apache/spark/blob/9af338cd685bce26abbc2dd4d077bde5068157b1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L48>
>>> and NthValue
>>> <https://github.com/apache/spark/blob/568ad6aa4435ce76ca3b5d9966e64259ea1f9b38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L675>
>>> .
>>>
>>> I don't see a template I can follow for building a new expression out of
>>> existing expressions (i.e. without having to implement a bunch of methods
>>> for DeclarativeAggregate or ImperativeAggregate). I also don't know how I
>>> would wrap NthValue to make it usable as a regular aggregate function. The
>>> wrapped NthValue would need an implicit window that provides the necessary
>>> ordering.
>>>
>>>
>>> Is there any potential to this idea? Any pointers on how to implement it?
>>>
>>>
>>> 2. Another memory-light approach to calculating the median requires
>>> multiple passes over the data to converge on the answer. The approach is 
>>> described
>>> here
>>> <https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers>.
>>> (I posted a sketch implementation of this approach using Spark's user-level
>>> API here
>>> <https://issues.apache.org/jira/browse/SPARK-26589?focusedCommentId=17452081=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17452081>
>>> .)
>>>
>>> I am also struggling to understand how I would build an aggregate
>>> function like this, since it requires multiple passes over the data. From
>>> what I can see, Catalyst's aggregate functions are designed to work with a
>>> single pass over the data.
>>>
>>> We don't seem to have an interface for AggregateFunction that supports
>>> multiple passes over the data. Is there some way to do this?
>>>
>>>
>>> Again, this is my first serious foray into Catalyst. Any specific
>>> implementation guidance is appreciated!
>>>
>>> Nick
>>>
>>
>

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Nicholas Chammas

No takers here? :)

I can see now why a median function is not available in most data
processing systems. It's pretty annoying to implement!

On Thu, Dec 9, 2021 at 9:25 PM Nicholas Chammas 
wrote:

> I'm trying to create a new aggregate function. It's my first time working
> with Catalyst, so it's exciting---but I'm also in a bit over my head.
>
> My goal is to create a function to calculate the median
> <https://issues.apache.org/jira/browse/SPARK-26589>.
>
> As a very simple solution, I could just define median to be an alias of 
> `Percentile(col,
> 0.5)`. However, the leading comment on the Percentile expression
> <https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L37-L39>
> highlights that it's very memory-intensive and can easily lead to
> OutOfMemory errors.
>
> So instead of using Percentile, I'm trying to create an Expression that
> calculates the median without needing to hold everything in memory at once.
> I'm considering two different approaches:
>
> 1. Define Median as a combination of existing expressions: The median can
> perhaps be built out of the existing expressions for Count
> <https://github.com/apache/spark/blob/9af338cd685bce26abbc2dd4d077bde5068157b1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L48>
> and NthValue
> <https://github.com/apache/spark/blob/568ad6aa4435ce76ca3b5d9966e64259ea1f9b38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L675>
> .
>
> I don't see a template I can follow for building a new expression out of
> existing expressions (i.e. without having to implement a bunch of methods
> for DeclarativeAggregate or ImperativeAggregate). I also don't know how I
> would wrap NthValue to make it usable as a regular aggregate function. The
> wrapped NthValue would need an implicit window that provides the necessary
> ordering.
>
>
> Is there any potential to this idea? Any pointers on how to implement it?
>
>
> 2. Another memory-light approach to calculating the median requires
> multiple passes over the data to converge on the answer. The approach is 
> described
> here
> <https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers>.
> (I posted a sketch implementation of this approach using Spark's user-level
> API here
> <https://issues.apache.org/jira/browse/SPARK-26589?focusedCommentId=17452081=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17452081>
> .)
>
> I am also struggling to understand how I would build an aggregate function
> like this, since it requires multiple passes over the data. From what I can
> see, Catalyst's aggregate functions are designed to work with a single pass
> over the data.
>
> We don't seem to have an interface for AggregateFunction that supports
> multiple passes over the data. Is there some way to do this?
>
>
> Again, this is my first serious foray into Catalyst. Any specific
> implementation guidance is appreciated!
>
> Nick
>
>

Creating a memory-efficient AggregateFunction to calculate Median

2021-12-09 Thread Nicholas Chammas

I'm trying to create a new aggregate function. It's my first time working
with Catalyst, so it's exciting---but I'm also in a bit over my head.

My goal is to create a function to calculate the median
.

As a very simple solution, I could just define median to be an alias
of `Percentile(col,
0.5)`. However, the leading comment on the Percentile expression

highlights that it's very memory-intensive and can easily lead to
OutOfMemory errors.

So instead of using Percentile, I'm trying to create an Expression that
calculates the median without needing to hold everything in memory at once.
I'm considering two different approaches:

1. Define Median as a combination of existing expressions: The median can
perhaps be built out of the existing expressions for Count

and NthValue

.

I don't see a template I can follow for building a new expression out of
existing expressions (i.e. without having to implement a bunch of methods
for DeclarativeAggregate or ImperativeAggregate). I also don't know how I
would wrap NthValue to make it usable as a regular aggregate function. The
wrapped NthValue would need an implicit window that provides the necessary
ordering.


Is there any potential to this idea? Any pointers on how to implement it?


2. Another memory-light approach to calculating the median requires
multiple passes over the data to converge on the answer. The approach
is described
here
.
(I posted a sketch implementation of this approach using Spark's user-level
API here

.)

I am also struggling to understand how I would build an aggregate function
like this, since it requires multiple passes over the data. From what I can
see, Catalyst's aggregate functions are designed to work with a single pass
over the data.

We don't seem to have an interface for AggregateFunction that supports
multiple passes over the data. Is there some way to do this?


Again, this is my first serious foray into Catalyst. Any specific
implementation guidance is appreciated!

Nick

Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Nicholas Chammas

Farewell to Jenkins and its classic weather forecast build status icons:

[image: health-80plus.png][image: health-60to79.png][image:
health-40to59.png][image: health-20to39.png][image: health-00to19.png]

And thank you Shane for all the help over these years.

Will you be nuking all the Jenkins-related code in the repo after the 23rd?

On Mon, Dec 6, 2021 at 3:02 PM shane knapp ☠  wrote:

> hey everyone!
>
> after a marathon run of nearly a decade, we're finally going to be
> shutting down {amp|rise}lab jenkins at the end of this month...
>
> the earliest snapshot i could find is from 2013 with builds for spark 0.7:
>
> https://web.archive.org/web/20130426155726/https://amplab.cs.berkeley.edu/jenkins/
>
> it's been a hell of a run, and i'm gonna miss randomly tweaking the build
> system, but technology has moved on and running a dedicated set of servers
> for just one open source project is just too expensive for us here at uc
> berkeley.
>
> if there's interest, i'll fire up a zoom session and all y'alls can watch
> me type the final command:
>
> systemctl stop jenkins
>
> feeling bittersweet,
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Nicholas Chammas

Side note about time travel: There is a PR
 to add VERSION/TIMESTAMP AS OF
syntax to Spark SQL.

On Mon, Nov 15, 2021 at 2:23 PM Ryan Blue  wrote:

> I want to note that I wouldn't recommend time traveling this way by using
> the hint for `snapshot-id`. Instead, we want to add the standard SQL syntax
> for that in a separate PR. This is useful for other options that help a
> table scan perform better, like specifying the target split size.
>
> You're right that this isn't a typical optimizer hint, but I'm not sure
> what other syntax is possible for this use case. How else would we send
> custom properties through to the scan?
>
> On Mon, Nov 15, 2021 at 9:25 AM Mich Talebzadeh 
> wrote:
>
>> I am looking at the hint and it appears to me (I stand corrected), it is
>> a single table hint as below:
>>
>> -- time travel
>> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>>
>> My assumption is that any view on this table will also benefit from this
>> hint. This is not a hint to optimizer in a classical sense. Only a snapshot
>> hint. Normally, a hint is an instruction to the optimizer. When writing
>> SQL, one may know information about the data unknown to the optimizer.
>> Hints enable one to make decisions normally made by the optimizer,
>> sometimes causing the optimizer to select a plan that it sees as higher
>> cost.
>>
>>
>> So far as this case is concerned, it looks OK and I concur it should be
>> extended to write as well.
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 15 Nov 2021 at 17:02, Russell Spitzer 
>> wrote:
>>
>>> I think since we probably will end up using this same syntax on write,
>>> this makes a lot of sense. Unless there is another good way to express a
>>> similar concept during a write operation I think going forward with this
>>> would be ok.
>>>
>>> On Mon, Nov 15, 2021 at 10:44 AM Ryan Blue  wrote:
>>>
 The proposed feature is to be able to pass options through SQL like you
 would when using the DataFrameReader API, so it would work for all
 sources that support read options. Read options are part of the DSv2 API,
 there just isn’t a way to pass options when using SQL. The PR also has a
 non-Iceberg example, which is being able to customize some JDBC source
 behaviors per query (e.g., fetchSize), rather than globally in the table’s
 options.

 The proposed syntax is odd, but I think that's an artifact of Spark
 introducing read options that aren't a normal part of SQL. Seems reasonable
 to me to pass them through a hint.

 On Mon, Nov 15, 2021 at 2:18 AM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Interesting.
>
> What is this going to add on top of support for Apache Iceberg
> . Will it be in
> line with support for Hive ACID tables or Delta Lake?
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
>
> On Mon, 15 Nov 2021 at 01:56, Zhun Wang 
> wrote:
>
>> Hi dev,
>>
>> We are discussing Support Dynamic Table Options for Spark SQL (
>> https://github.com/apache/spark/pull/34072). It is currently not
>> sure if the syntax makes sense, and would like to know if there is other
>> feedback or opinion on this.
>>
>> I would appreciate any feedback on this.
>>
>> Thanks.
>>
>

 --
 Ryan Blue
 Tabular

>>>
>
> --
> Ryan Blue
> Tabular
>

Jira components cleanup

2021-11-15 Thread Nicholas Chammas

https://issues.apache.org/jira/projects/SPARK?selectedItem=com.atlassian.jira.jira-projects-plugin:components-page

I think the "docs" component should be merged into "Documentation".

Likewise, the "k8" component should be merged into "Kubernetes".

I think anyone can technically update tags, but I think mass retagging
should be limited to admins (or at least, to someone who got prior approval
from an admin).

Nick

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Nicholas Chammas

On Tue, Mar 16, 2021 at 9:15 PM Hyukjin Kwon  wrote:

>   I am currently thinking we will have to convert the Koalas tests to use
> unittests to match with PySpark for now.
>
Keep in mind that pytest supports unittest-based tests out of the box
, so you should be able to
run pytest against the PySpark codebase without changing much about the
tests.

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Nicholas Chammas

On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin  wrote:

> I don't think we should deprecate existing APIs.
>

+1

I strongly prefer Spark's immutable DataFrame API to the Pandas API. I
could be wrong, but I wager most people who have worked with both Spark and
Pandas feel the same way.

For the large community of current PySpark users, or users switching to
PySpark from another Spark language API, it doesn't make sense to deprecate
the current API, even by convention.

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-11 Thread Nicholas Chammas

OK, perhaps the best course of action is to leave the current behavior
as-is but clarify the documentation for `.checkpoint()` and/or
`cleanCheckpoints`.

I personally find it confusing that `cleanCheckpoints` doesn't address
shutdown behavior, and the Stack Overflow links I shared
<https://issues.apache.org/jira/browse/SPARK-33000> show that many people
are in the same situation. There is clearly some demand for Spark to
automatically clean up checkpoints on shutdown. But perhaps that should
be... a new config? a rejected feature? something else? I dunno.

Does anyone else have thoughts on how to approach this?

On Wed, Mar 10, 2021 at 4:39 PM Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:

> > Checkpoint data is left behind after a normal shutdown, not just an
> unexpected shutdown. The PR description includes a simple demonstration of
> this.
>
> I think I might overemphasized a bit the "unexpected" adjective to show
> you the value in the current behavior.
>
> The feature configured with
> "spark.cleaner.referenceTracking.cleanCheckpoints" is about out of scoped
> references without ANY shutdown.
>
> It would be hard to distinguish that level (ShutdownHookManager) the
> unexpected from the intentional exits.
> As the user code (run by driver) could contain a System.exit() which was
> added by the developer for numerous reasons (this way distinguishing
> unexpected and not unexpected is not really an option).
> Even a third party library can contain s System.exit(). Would that be an
> unexpected exit or intentional? You can see it is hard to tell.
>
> To test the real feature
> behind "spark.cleaner.referenceTracking.cleanCheckpoints" you can create a
> reference within a scope which is closed. For example within the body of a
> function (without return value) and store it only in a local
> variable. After the scope is closed in case of our function when the caller
> gets the control back you have chance to see the context cleaner working
> (you might even need to trigger a GC too).
>
> On Wed, Mar 10, 2021 at 10:09 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Checkpoint data is left behind after a normal shutdown, not just an
>> unexpected shutdown. The PR description includes a simple demonstration of
>> this.
>>
>> If the current behavior is truly intended -- which I find difficult to
>> believe given how confusing <https://stackoverflow.com/q/52630858/877069>
>> it <https://stackoverflow.com/q/60009856/877069> is
>> <https://stackoverflow.com/q/61454740/877069> -- then at the very least
>> we need to update the documentation for both `.checkpoint()` and
>> `cleanCheckpoints` to make that clear.
>>
>> > This way even after an unexpected exit the next run of the same app
>> should be able to pick up the checkpointed data.
>>
>> The use case you are describing potentially makes sense. But preserving
>> checkpoint data after an unexpected shutdown -- even when
>> `cleanCheckpoints` is set to true -- is a new guarantee that is not
>> currently expressed in the API or documentation. At least as far as I can
>> tell.
>>
>> On Wed, Mar 10, 2021 at 3:10 PM Attila Zsolt Piros <
>> piros.attila.zs...@gmail.com> wrote:
>>
>>> Hi Nick!
>>>
>>> I am not sure you are fixing a problem here. I think what you see is as
>>> problem is actually an intended behaviour.
>>>
>>> Checkpoint data should outlive the unexpected shutdowns. So there is a
>>> very important difference between the reference goes out of scope during a
>>> normal execution (in this case cleanup is expected depending on the config
>>> you mentioned) and when a references goes out of scope because of an
>>> unexpected error (in this case you should keep the checkpoint data).
>>>
>>> This way even after an unexpected exit the next run of the same app
>>> should be able to pick up the checkpointed data.
>>>
>>> Best Regards,
>>> Attila
>>>
>>>
>>>
>>>
>>> On Wed, Mar 10, 2021 at 8:10 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Hello people,
>>>>
>>>> I'm working on a fix for SPARK-33000
>>>> <https://issues.apache.org/jira/browse/SPARK-33000>. Spark does not
>>>> cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate
>>>> configs are set.
>>>>
>>>> In the course of developing a fix, another contributor pointed out
>>>> <https://github.com/apache/spa

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Nicholas Chammas

Checkpoint data is left behind after a normal shutdown, not just an
unexpected shutdown. The PR description includes a simple demonstration of
this.

If the current behavior is truly intended -- which I find difficult to
believe given how confusing <https://stackoverflow.com/q/52630858/877069> it
<https://stackoverflow.com/q/60009856/877069> is
<https://stackoverflow.com/q/61454740/877069> -- then at the very least we
need to update the documentation for both `.checkpoint()` and
`cleanCheckpoints` to make that clear.

> This way even after an unexpected exit the next run of the same app
should be able to pick up the checkpointed data.

The use case you are describing potentially makes sense. But preserving
checkpoint data after an unexpected shutdown -- even when
`cleanCheckpoints` is set to true -- is a new guarantee that is not
currently expressed in the API or documentation. At least as far as I can
tell.

On Wed, Mar 10, 2021 at 3:10 PM Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:

> Hi Nick!
>
> I am not sure you are fixing a problem here. I think what you see is as
> problem is actually an intended behaviour.
>
> Checkpoint data should outlive the unexpected shutdowns. So there is a
> very important difference between the reference goes out of scope during a
> normal execution (in this case cleanup is expected depending on the config
> you mentioned) and when a references goes out of scope because of an
> unexpected error (in this case you should keep the checkpoint data).
>
> This way even after an unexpected exit the next run of the same app should
> be able to pick up the checkpointed data.
>
> Best Regards,
> Attila
>
>
>
>
> On Wed, Mar 10, 2021 at 8:10 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Hello people,
>>
>> I'm working on a fix for SPARK-33000
>> <https://issues.apache.org/jira/browse/SPARK-33000>. Spark does not
>> cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate
>> configs are set.
>>
>> In the course of developing a fix, another contributor pointed out
>> <https://github.com/apache/spark/pull/31742#issuecomment-790987483> that
>> checkpointed data may not be the only type of resource that needs a fix for
>> shutdown cleanup.
>>
>> I'm looking for a committer who might have an opinion on how Spark should
>> clean up disk-based resources on shutdown. The last people who contributed
>> significantly to the ContextCleaner, where this cleanup happens, were
>> @witgo <https://github.com/witgo> and @andrewor14
>> <https://github.com/andrewor14>. But that was ~6 years ago, and I don't
>> think they are active on the project anymore.
>>
>> Any takers to take a look and give their thoughts? The PR is small
>> <https://github.com/apache/spark/pull/31742>. +39 / -2.
>>
>> Nick
>>
>>

Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Nicholas Chammas

Hello people,

I'm working on a fix for SPARK-33000
. Spark does not cleanup
checkpointed RDDs/DataFrames on shutdown, even if the appropriate configs
are set.

In the course of developing a fix, another contributor pointed out
 that
checkpointed data may not be the only type of resource that needs a fix for
shutdown cleanup.

I'm looking for a committer who might have an opinion on how Spark should
clean up disk-based resources on shutdown. The last people who contributed
significantly to the ContextCleaner, where this cleanup happens, were @witgo
 and @andrewor14 .
But that was ~6 years ago, and I don't think they are active on the project
anymore.

Any takers to take a look and give their thoughts? The PR is small
. +39 / -2.

Nick

Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Nicholas Chammas

On Thu, Feb 18, 2021 at 10:34 AM Sean Owen  wrote:

> There is no way to force people to review or commit something of course.
> And keep in mind we get a lot of, shall we say, unuseful pull requests.
> There is occasionally some blowback to closing someone's PR, so the path of
> least resistance is often the timeout / 'soft close'. That is, it takes a
> lot more time to satisfactorily debate down the majority of PRs that
> probably shouldn't get merged, and there just isn't that much bandwidth.
> That said of course it's bad if lots of good PRs are getting lost in the
> shuffle and I am sure there are some.
>
> One other aspect is that a committer is taking some degree of
> responsibility for merging a change, so the ask is more than just a few
> minutes of eyeballing. If it breaks something the merger pretty much owns
> resolving it, and, the whole project owns any consequence of the change for
> the future.
>

+1

Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Nicholas Chammas

On Thu, Feb 18, 2021 at 9:58 AM Enrico Minack 
wrote:

> *What is the approved way to ...*
>
> *... prevent it from being auto-closed?* Committing and commenting to the
> PR does not prevent it from being closed the next day.
>
Committing and commenting should prevent the PR from being closed. It may
be that commenting after the stale message has been posted does not work
(which would likely be a bug in the action
 or in our config
),
but there are PRs that have been open for months with consistent activity
that do not get closed.

So at the very least, proactively committing or commenting every month will
keep the PR open. However, that's not the real problem, right? The real
problem is getting committer attention.

*...** re-open it? *The comment says "If you'd like to revive this PR,
> please reopen it ...", but there is no re-open button anywhere on the PR!
>
> I don't know if there is a repo setting here that allows non-committers to
reopen their own closed PRs. At the very worst, you can always open a new
PR from the same branch, though we should update the stale message text if
contributors cannot in fact reopen their own PRs.

> What is the expected contributor's response to a PR that does not get
> feedback? Giving up?
>
I've baby-sat several PRs that took months to get in. Here's an example
 off the top of my head (5-6
months to be merged in). I'm sure everyone on here, including most
committers themselves, have had this experience. It's common. The expected
response is to be persistent, to try to find a committer or shepherd for
your PR, and to proactively make your PR easier to review.

> Are there processes in place to increase the probability PRs do not get
> forgotten, auto-closed and lost?
>
There are things you can do as a contributor to increase the likelihood
your PR will get reviewed, but I wouldn't call them "processes". This is an
open source project built on corporate sponsorship for some stuff and
volunteer energy for everything else. There is no guarantee or formal
obligation for anyone to do the work of reviewing PRs. That's just the
nature of open source work.

The things that you can do are:

   - Make your PR as small and focused as possible.
   - Make sure the build is passing and that you've followed the
   contributing guidelines.
   - Find the people who most recently worked on the area you're touching
   and ask them for help.
   - Address reviewers' requests and concerns.
   - Try to get some committer buy-in for your idea before spending time
   developing it.
   - Ask for input on the dev list for your PR.

Basically, most of the advice boils down to "make it easy for reviewers''.
Even then, though, sometimes things won't work out
 (5-6 months and closed without
merging). It's just the nature of contributing to a large project like
Spark where there is a lot going on.

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-28 Thread Nicholas Chammas

On Thu, Jan 28, 2021 at 3:40 PM Sean Owen  wrote:

> It isn't that regexp_extract_all (for example) is useless outside SQL,
> just, where do you draw the line? Supporting 10s of random SQL functions
> across 3 other languages has a cost, which has to be weighed against
> benefit, which we can never measure well except anecdotally: one or two
> people say "I want this" in a sea of hundreds of thousands of users.
>

+1 to this, but I will add that Jira and Stack Overflow activity can
sometimes give good signals about API gaps that are frustrating users. If
there is an SO question with 30K views about how to do something that
should have been easier, then that's an important signal about the API.

For this specific case, I think there is a fine argument
> that regexp_extract_all should be added simply for consistency
> with regexp_extract. I can also see the argument that regexp_extract was a
> step too far, but, what's public is now a public API.
>

I think in this case a few references to where/how people are having to
work around missing a direct function for regexp_extract_all could help
guide the decision. But that itself means we are making these decisions on
a case-by-case basis.

>From a user perspective, it's definitely conceptually simpler to have SQL
functions be consistent and available across all APIs.

Perhaps if we had a way to lower the maintenance burden of keeping
functions in sync across SQL/Scala/Python/R, it would be easier for
everyone to agree to just have all the functions be included across the
board all the time.

Would, for example, some sort of automatic testing mechanism for SQL
functions help here? Something that uses a common function testing
specification to automatically test SQL, Scala, Python, and R functions,
without requiring maintainers to write tests for each language's version of
the functions. Would that address the maintenance burden?

Re: [DISCUSS][SPIP] Standardize Spark Exception Messages

2020-10-25 Thread Nicholas Chammas

Just want to call out that this SPIP should probably account somehow for
PySpark and the work being done in SPARK-32082
 to improve PySpark
exceptions.

On Sun, Oct 25, 2020 at 8:05 PM Xinyi Yu  wrote:

> Hi all,
>
> We like to post a SPIP of Standardize Exception Messages in Spark. Here is
> the document link:
>
> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing
> <
> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing>
>
>
> This SPIP aims to standardize the exception messages in Spark. It has three
> major focuses:
> 1. Group exception messages in dedicated files for easy maintenance and
> auditing.
> 2. Establish an error message guideline for developers.
> 3. Improve error message quality.
>
> Thanks for your time and patience. Looking forward to your feedback!
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: get method guid prefix for file parts for write

2020-09-25 Thread Nicholas Chammas

I think what George is looking for is a way to determine ahead of time the
partition IDs that Spark will use when writing output.

George,

I believe this is an example of what you're looking for:
https://github.com/databricks/spark-redshift/blob/184b4428c1505dff7b4365963dc344197a92baa9/src/main/scala/com/databricks/spark/redshift/RedshiftWriter.scala#L240-L257

Specifically, the part that says "TaskContext.get.partitionId()".

I don't know how much of that is part of Spark's public API, but there it
is.

It would be useful if Spark offered a way to get a manifest of output files
for any given write operation, similar to Redshift's MANIFEST option
. This would
help when, for example, you need to pass a list of files output by Spark to
some other system (like Redshift) and don't want to have to worry about the
consistency guarantees of your object store's list operations.

Nick

On Fri, Sep 25, 2020 at 2:00 PM EveLiao  wrote:

> If I understand your problem correctly, the prefix you provided is actually
> "-" + UUID. You can get it by uuid generator like
> https://docs.python.org/3/library/uuid.html#uuid.uuid4.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

PySpark: Un-deprecating inferring DataFrame schema from list of dictionaries

2020-08-24 Thread Nicholas Chammas

https://github.com/apache/spark/pull/29510

I don't think this is a big deal, but since we're removing a deprecation
that has been around for ~6 years, I figured it would be good to bring
everyone's attention to this change.

Hopefully, we are not breaking any hidden assumptions about the direction
of the PySpark API.

Nick

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Nicholas Chammas

To rephrase my earlier email, PyPI users would care about the bundled
Hadoop version if they have a workflow that, in effect, looks something
like this:

```
pip install pyspark
pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7
spark.read.parquet('s3a://...')
```

I agree that Hadoop 3 would be a better default (again, the s3a support is
just much better). But to Xiao's point, if you are expecting Spark to work
with some package like hadoop-aws that assumes an older version of Hadoop
bundled with Spark, then changing the default may break your workflow.

In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7 to
hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that
would be more difficult to repair. 路‍♂️

On Wed, Jun 24, 2020 at 1:44 PM Sean Owen  wrote:

> I'm also genuinely curious when PyPI users would care about the
> bundled Hadoop jars - do we even need two versions? that itself is
> extra complexity for end users.
> I do think Hadoop 3 is the better choice for the user who doesn't
> care, and better long term.
> OK but let's at least move ahead with changing defaults.
>
> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li  wrote:
> >
> > Hi, Dongjoon,
> >
> > Please do not misinterpret my point. I already clearly said "I do not
> know how to track the popularity of Hadoop 2 vs Hadoop 3."
> >
> > Also, let me repeat my opinion:  the top priority is to provide two
> options for PyPi distribution and let the end users choose the ones they
> need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any
> breaking change, let us follow our protocol documented in
> https://spark.apache.org/versioning-policy.html.
> >
> > If you just want to change the Jenkins setup, I am OK about it. If you
> want to change the default distribution, we need more discussions in the
> community for getting an agreement.
> >
> >  Thanks,
> >
> > Xiao
> >
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Nicholas Chammas

The team I'm on currently uses pip-installed PySpark for local development,
and we regularly access S3 directly from our laptops/workstations.

One of the benefits of having Spark built against Hadoop 3.2 vs. 2.7 is
being able to use a recent version of hadoop-aws that has mature support
for s3a. With Hadoop 2.7 the support for s3a is buggy and incomplete, and
there are incompatibilities that prevent you from using Spark built against
Hadoop 2.7 with hadoop-aws version 2.8 or newer.

On Wed, Jun 24, 2020 at 10:15 AM Sean Owen  wrote:

> Will pyspark users care much about Hadoop version? they won't if running
> locally. They will if connecting to a Hadoop cluster. Then again in that
> context, they're probably using a distro anyway that harmonizes it.
> Hadoop 3's installed based can't be that large yet; it's been around far
> less time.
>
> The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
> eventually, not now.
> But if the question now is build defaults, is it a big deal either way?
>
> On Tue, Jun 23, 2020 at 11:03 PM Xiao Li  wrote:
>
>> I think we just need to provide two options and let end users choose the
>> ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make Pyspark
>> Hadoop 3.2+ Variant available in PyPI) is a high priority task for Spark
>> 3.1 release to me.
>>
>> I do not know how to track the popularity of Hadoop 2 vs Hadoop 3. Based
>> on this link
>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it
>> sounds like Hadoop 3.x is not as popular as Hadoop 2.7.
>>
>>
>>

Re: [VOTE] Release Spark 2.4.6 (RC8)

2020-06-03 Thread Nicholas Chammas

I believe that was fixed in 3.0 and there was a decision not to backport
the fix: SPARK-31170 

On Wed, Jun 3, 2020 at 1:04 PM Xiao Li  wrote:

> Just downloaded it in my local macbook. Trying to create a table using the
> pre-built PySpark. It sounds like the conf "spark.sql.warehouse.dir"
> does not take an effect. It is trying to create a directory in
> "file:/user/hive/warehouse/t1". I have not done any investigation yet. Have
> any of you hit the same issue?
>
> C02XT0U7JGH5:bin lixiao$ ./pyspark --conf
> spark.sql.warehouse.dir="/Users/lixiao/Downloads/spark-2.4.6-bin-hadoop2.6"
>
> Python 2.7.16 (default, Jan 27 2020, 04:46:15)
>
> [GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.37.14)] on darwin
>
> Type "help", "copyright", "credits" or "license" for more information.
>
> 20/06/03 09:56:11 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
>
> Setting default log level to "WARN".
>
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
> setLogLevel(newLevel).
>
> Welcome to
>
>     __
>
>  / __/__  ___ _/ /__
>
> _\ \/ _ \/ _ `/ __/  '_/
>
>/__ / .__/\_,_/_/ /_/\_\   version 2.4.6
>
>   /_/
>
>
> Using Python version 2.7.16 (default, Jan 27 2020 04:46:15)
>
> SparkSession available as 'spark'.
>
> >>> spark.sql("set spark.sql.warehouse.dir").show(truncate=False)
>
> +---+-+
>
> |key|value
> |
>
> +---+-+
>
> |spark.sql.warehouse.dir|/Users/lixiao/Downloads/spark-2.4.6-bin-hadoop2.6|
>
> +---+-+
>
>
> >>> spark.sql("create table t1 (col1 int)")
>
> 20/06/03 09:56:29 WARN HiveMetaStore: Location:
> file:/user/hive/warehouse/t1 specified for non-external table:t1
>
> Traceback (most recent call last):
>
>   File "", line 1, in 
>
>   File
> "/Users/lixiao/Downloads/spark-2.4.6-bin-hadoop2.6/python/pyspark/sql/session.py",
> line 767, in sql
>
> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>
>   File
> "/Users/lixiao/Downloads/spark-2.4.6-bin-hadoop2.6/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
> line 1257, in __call__
>
>   File
> "/Users/lixiao/Downloads/spark-2.4.6-bin-hadoop2.6/python/pyspark/sql/utils.py",
> line 69, in deco
>
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
>
> pyspark.sql.utils.AnalysisException:
> u'org.apache.hadoop.hive.ql.metadata.HiveException:
> MetaException(message:file:/user/hive/warehouse/t1 is not a directory or
> unable to create one);'
>
> Dongjoon Hyun  于2020年6月3日周三 上午9:18写道：
>
>> +1
>>
>> Bests,
>> Dongjoon
>>
>> On Wed, Jun 3, 2020 at 5:59 AM Tom Graves 
>> wrote:
>>
>>>  +1
>>>
>>> Tom
>>>
>>> On Sunday, May 31, 2020, 06:47:09 PM CDT, Holden Karau <
>>> hol...@pigscanfly.ca> wrote:
>>>
>>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.4.6.
>>>
>>> The vote is open until June 5th at 9AM PST and passes if a majority +1
>>> PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.4.6
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> There are currently no issues targeting 2.4.6 (try project = SPARK AND
>>> "Target Version/s" = "2.4.6" AND status in (Open, Reopened, "In Progress"))
>>>
>>> The tag to be voted on is v2.4.6-rc8 (commit
>>> 807e0a484d1de767d1f02bd8a622da6450bdf940):
>>> https://github.com/apache/spark/tree/v2.4.6-rc8
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc8-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1349/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc8-docs/
>>>
>>> The list of bug fixes going into 2.4.6 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12346781
>>>
>>> This release is using the release script of the tag v2.4.6-rc8.
>>>
>>> FAQ
>>>
>>> =
>>> What happened to the other RCs?
>>> =
>>>
>>> The parallel maven build caused some flakiness so I wasn't comfortable
>>> releasing them. I backported the fix from the 3.0 branch for this release.
>>> I've got a proposed change to the build script so that we only push tags
>>> when once the build is a

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2020-04-29 Thread Nicholas Chammas

Not sure what you mean. The native integration will auto-link from a Jira
ticket to the PRs that mention that ticket. I don't think it will update
the ticket's status, though.

Would you like me to file a ticket with Infra and see what they say?

On Tue, Apr 28, 2020 at 12:21 AM Hyukjin Kwon  wrote:

> Maybe it's time to switch. Do you know if we can still link the JIRA
> against Github?
> The script used to change the status of JIRA too but it stopped working
> for a long time - I suspect this isn't a big deal.
>
> 2020년 4월 25일 (토) 오전 10:31, Nicholas Chammas 님이
> 작성:
>
>> Have we asked Infra recently about enabling the native Jira-GitHub
>> integration
>> <https://confluence.atlassian.com/adminjiracloud/connect-jira-cloud-to-github-814188429.html>?
>> Maybe we can deprecate the part of this script that updates Jira tickets
>> with links to the PR and rely on the native integration instead. We use it
>> at my day job, for example.
>>
>> On Fri, Apr 24, 2020 at 12:39 AM Hyukjin Kwon 
>> wrote:
>>
>>> Hi all,
>>>
>>> Seems like this github_jira_sync.py
>>> <https://github.com/apache/spark/blob/master/dev/github_jira_sync.py> script
>>> seems stopped working completely now.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-31532 <>
>>> https://github.com/apache/spark/pull/28316
>>> https://issues.apache.org/jira/browse/SPARK-31529 <>
>>> https://github.com/apache/spark/pull/28315
>>> https://issues.apache.org/jira/browse/SPARK-31528 <>
>>> https://github.com/apache/spark/pull/28313
>>>
>>> Josh, would you mind taking a look please when you find some time?
>>> There is a bunch of JIRAs now, and it is very confusing which JIRA is in
>>> progress with a PR or not.
>>>
>>>
>>> 2019년 7월 26일 (금) 오후 1:20, Hyukjin Kwon 님이 작성:
>>>
>>>> Just FYI, I had to come up with a better JQL to filter out the JIRAs
>>>> that already have linked PRs.
>>>> In case it helps someone, I use this JQL now to look through the open
>>>> JIRAs:
>>>>
>>>> project = SPARK AND
>>>> status = Open AND
>>>> NOT issueFunction in linkedIssuesOfRemote("Github Pull Request *")
>>>> ORDER BY created DESC, priority DESC, updated DESC
>>>>
>>>>
>>>>
>>>>
>>>> 2019년 7월 19일 (금) 오후 4:54, Hyukjin Kwon 님이 작성:
>>>>
>>>>> That's a great explanation. Thanks I didn't know that.
>>>>>
>>>>> Josh, do you know who I should ping on this?
>>>>>
>>>>> On Fri, 19 Jul 2019, 16:52 Dongjoon Hyun, 
>>>>> wrote:
>>>>>
>>>>>> Hi, Hyukjin.
>>>>>>
>>>>>> In short, there are two bots. And, the current situation happens when
>>>>>> only one bot with `dev/github_jira_sync.py` works.
>>>>>>
>>>>>> And, `dev/github_jira_sync.py` is irrelevant to the JIRA status
>>>>>> change because it only use `add_remote_link` and `add_comment` API.
>>>>>> I know only this bot (in Apache Spark repository repo)
>>>>>>
>>>>>> AFAIK, `deb/github_jira_sync.py`'s activity is done under JIRA ID
>>>>>> `githubbot` (Name: `ASF GitHub Bot`).
>>>>>> And, the other bot's activity is done under JIRA ID `apachespark`
>>>>>> (Name: `Apache Spark`).
>>>>>> The other bot is the one which Josh mentioned before. (in
>>>>>> `databricks/spark-pr-dashboard` repo).
>>>>>>
>>>>>> The root cause will be the same. The API key used by the bot is
>>>>>> rejected by Apache JIRA and forwarded to CAPCHAR.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>> On Thu, Jul 18, 2019 at 8:24 PM Hyukjin Kwon 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Seems this issue is re-happening again. Seems the PR link is
>>>>>>> properly created in the corresponding JIRA but it doesn't change the 
>>>>>>> JIRA's
>>>>>>> status from OPEN to IN-PROGRESS.
>>>>>>>
>>>>>>> See, for instance,
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/SPARK-28443
>>>>>>> https://issues.apache.or

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2020-04-24 Thread Nicholas Chammas

Have we asked Infra recently about enabling the native Jira-GitHub
integration
?
Maybe we can deprecate the part of this script that updates Jira tickets
with links to the PR and rely on the native integration instead. We use it
at my day job, for example.

On Fri, Apr 24, 2020 at 12:39 AM Hyukjin Kwon  wrote:

> Hi all,
>
> Seems like this github_jira_sync.py
>  script
> seems stopped working completely now.
>
> https://issues.apache.org/jira/browse/SPARK-31532 <>
> https://github.com/apache/spark/pull/28316
> https://issues.apache.org/jira/browse/SPARK-31529 <>
> https://github.com/apache/spark/pull/28315
> https://issues.apache.org/jira/browse/SPARK-31528 <>
> https://github.com/apache/spark/pull/28313
>
> Josh, would you mind taking a look please when you find some time?
> There is a bunch of JIRAs now, and it is very confusing which JIRA is in
> progress with a PR or not.
>
>
> 2019년 7월 26일 (금) 오후 1:20, Hyukjin Kwon 님이 작성:
>
>> Just FYI, I had to come up with a better JQL to filter out the JIRAs that
>> already have linked PRs.
>> In case it helps someone, I use this JQL now to look through the open
>> JIRAs:
>>
>> project = SPARK AND
>> status = Open AND
>> NOT issueFunction in linkedIssuesOfRemote("Github Pull Request *")
>> ORDER BY created DESC, priority DESC, updated DESC
>>
>>
>>
>>
>> 2019년 7월 19일 (금) 오후 4:54, Hyukjin Kwon 님이 작성:
>>
>>> That's a great explanation. Thanks I didn't know that.
>>>
>>> Josh, do you know who I should ping on this?
>>>
>>> On Fri, 19 Jul 2019, 16:52 Dongjoon Hyun, 
>>> wrote:
>>>
 Hi, Hyukjin.

 In short, there are two bots. And, the current situation happens when
 only one bot with `dev/github_jira_sync.py` works.

 And, `dev/github_jira_sync.py` is irrelevant to the JIRA status change
 because it only use `add_remote_link` and `add_comment` API.
 I know only this bot (in Apache Spark repository repo)

 AFAIK, `deb/github_jira_sync.py`'s activity is done under JIRA ID
 `githubbot` (Name: `ASF GitHub Bot`).
 And, the other bot's activity is done under JIRA ID `apachespark`
 (Name: `Apache Spark`).
 The other bot is the one which Josh mentioned before. (in
 `databricks/spark-pr-dashboard` repo).

 The root cause will be the same. The API key used by the bot is
 rejected by Apache JIRA and forwarded to CAPCHAR.

 Bests,
 Dongjoon.

 On Thu, Jul 18, 2019 at 8:24 PM Hyukjin Kwon 
 wrote:

> Hi all,
>
> Seems this issue is re-happening again. Seems the PR link is properly
> created in the corresponding JIRA but it doesn't change the JIRA's status
> from OPEN to IN-PROGRESS.
>
> See, for instance,
>
> https://issues.apache.org/jira/browse/SPARK-28443
> https://issues.apache.org/jira/browse/SPARK-28440
> https://issues.apache.org/jira/browse/SPARK-28436
> https://issues.apache.org/jira/browse/SPARK-28434
> https://issues.apache.org/jira/browse/SPARK-28433
> https://issues.apache.org/jira/browse/SPARK-28431
>
> Josh and Dongjoon, do you guys maybe have any idea?
>
> 2019년 4월 25일 (목) 오후 3:09, Hyukjin Kwon 님이 작성:
>
>> Thank you so much Josh .. !!
>>
>> 2019년 4월 25일 (목) 오후 3:04, Josh Rosen 님이 작성:
>>
>>> The code for this runs in http://spark-prs.appspot.com (see
>>> https://github.com/databricks/spark-pr-dashboard/blob/1e799c9e510fa8cdc9a6c084a777436bebeabe10/sparkprs/controllers/tasks.py#L137
>>> )
>>>
>>> I checked the AppEngine logs and it looks like we're getting error
>>> responses, possibly due to a credentials issue:
>>>
>>> Exception when starting progress on JIRA issue SPARK-27355 (
 /base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/controllers/tasks.py:142
 )
 Traceback (most recent call last): File
 Traceback (most recent call last):
 File 
 "/base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/controllers/tasks.py",
 line 138
 ,
 in update_pr start_issue_progress("%s-%s" % 
 (app.config['JIRA_PROJECT'],
 issue_number)) File
 start_issue_progress("%s-%s" % (app.config['JIRA_PROJECT'],
 issue_number))
 File

Beginner PR against the Catalog API

2020-04-02 Thread Nicholas Chammas

I recently submitted my first Scala PR. It's very simple, though I don't
know if I've done things correctly since I'm not a regular Scala user.

SPARK-31000 : Add
ability to set table description in the catalog

https://github.com/apache/spark/pull/27908

Would someone be able to take a look at it and give me some feedback?

Nick

Re: Automatic PR labeling

2020-04-02 Thread Nicholas Chammas

SPARK-31330 <https://issues.apache.org/jira/browse/SPARK-31330>:
Automatically label PRs based on the paths they touch

On Wed, Apr 1, 2020 at 11:34 PM Hyukjin Kwon  wrote:

> @Nicholas Chammas  Would you be interested in
> tacking a look? I would love this to be done.
>
> 2020년 3월 25일 (수) 오전 10:30, Hyukjin Kwon 님이 작성:
>
>> That should be cool. There were a bit of discussions about which account
>> should label. If we can replace it, I think it sounds great!
>>
>> 2020년 3월 25일 (수) 오전 5:08, Nicholas Chammas 님이
>> 작성:
>>
>>> Public Service Announcement: There is a GitHub action that lets you
>>> automatically label PRs based on what paths they modify.
>>>
>>> https://github.com/actions/labeler
>>>
>>> If we set this up, perhaps down the line we can update the PR dashboard
>>> and PR merge script to use the tags.
>>>
>>> cc @Dongjoon Hyun , who may be interested in
>>> this.
>>>
>>> Nick
>>>
>>

Re: [DISCUSS] filling affected versions on JIRA issue

2020-04-01 Thread Nicholas Chammas

Probably the discussion here about Improvement Jira tickets and the
"Affects Version" field:
https://github.com/apache/spark/pull/27534#issuecomment-588416416

On Wed, Apr 1, 2020 at 9:59 PM Hyukjin Kwon  wrote:

> > 2) check with older versions to fill up affects version for bug
> I don't agree with this in general. To me usually it's "For the type of
> bug, assign one valid version" instead.
>
> > The only place where I can see some amount of investigation being
> required would be for security issues or correctness issues.
> Yes, I agree.
>
> Yes, was there a particular case or context that motivated this thread?
>
> 2020년 4월 2일 (목) 오전 10:24, Mridul Muralidharan 님이 작성:
>
>>
>> I agree with what Sean detailed.
>> The only place where I can see some amount of investigation being
>> required would be for security issues or correctness issues.
>> Knowing the affected versions, particularly if an earlier supported
>> version does not have the bug, will help users understand the
>> broken/insecure versions.
>>
>> Regards,
>> Mridul
>>
>>
>> On Wed, Apr 1, 2020 at 6:12 PM Sean Owen  wrote:
>>
>>> I think we discussed this briefly on a PR.
>>>
>>> It's not as clear what it means for an Improvement to 'affect a
>>> version'. Certainly, an improvement to a feature introduced in 1.2.3
>>> can't affect anything earlier, and implicitly affects everything
>>> after. It's not wrong to say it affects the latest version, at least.
>>> And I believe we require it in JIRA because we can't require an
>>> Affects Version for one type of issue but not another. So, just asking
>>> people to default to 'latest version' there is no burden.
>>>
>>> I would not ask someone to figure out all and earliest versions that
>>> an Improvement applies to; it just isn't that useful. We aren't
>>> generally going to back-port improvements anyway.
>>>
>>> Even for bugs, we don't really need to know that a bug in master
>>> affects 2.4.5, 2.4.4, 2.4.3, ... 2.3.6, 2.3.5, etc. It doesn't hurt to
>>> at least say it affects the latest 2.4.x, 2.3.x releases, if known,
>>> because it's possible it should be back-ported. Again even where this
>>> is significantly more useful, I'm not in favor of telling people they
>>> must test the bug report vs previous releases.
>>>
>>> So, if you're asserting that the current guidance is OK, I generally
>>> agree.
>>> Is there a particular context where this was questioned? maybe we
>>> should examine the particulars of that situation. As in all things,
>>> context matters.
>>>
>>> Sean
>>>
>>> On Wed, Apr 1, 2020 at 7:34 PM Jungtaek Lim
>>>  wrote:
>>> >
>>> > Hi devs,
>>> >
>>> > I know we're busy with making Spark 3.0 be out, but I think the topic
>>> is good to discuss at any time and actually be better to be resolved sooner
>>> than later.
>>> >
>>> > In the page "Contributing to Spark", we describe the guide of "affects
>>> version" as "For Bugs, assign at least one version that is known to exhibit
>>> the problem or need the change".
>>> >
>>> > For me, that sentence clearly describes minimal requirement of affects
>>> version via:
>>> >
>>> > * For the type of bug, assign one valid version
>>> > * For other types, there's no requirement
>>> >
>>> > but I'm seeing the requests more than the requirement which makes me
>>> think there might be different understanding of the sentence. Maybe there's
>>> more, but to summarize on such requests:
>>> >
>>> > 1) add affects version as same as master branch for improvement/new
>>> feature
>>> > 2) check with older versions to fill up affects version for bug
>>> >
>>> > I don't see any point on doing 1). It might give some context if we
>>> don't update the affect version (so that it can say which version was
>>> considered when filing JIRA issue) but we also update the affect version
>>> when we bump the master branch, which is no longer informational as the
>>> version should have been always the same as master branch.
>>> >
>>> > I agree it's ideal to do 2) but I think the reason the guide doesn't
>>> enforce is that it requires pretty much efforts to check with old versions
>>> (sometimes even more than origin work).
>>> >
>>> > Suppose the happy case we have UT to verify the bugfix which fails
>>> without the patch and passes with the patch. To check with older versions
>>> we have to checkout the tag, and apply the UT, and "rebuild", and run UT to
>>> verify which is pretty much time-consuming. What if there's a conflict
>>> indeed? That's still a happy case, and in worse case (there's no such UT)
>>> we should do E2E manual verification which I would give up.
>>> >
>>> > There should have some balance/threshold, and the balance should be
>>> the thing the community has a consensus.
>>> >
>>> > Would like to hear everyone's voice on this.
>>> >
>>> > Thanks,
>>> > Jungtaek Lim (HeartSaVioR)
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-28 Thread Nicholas Chammas

I don't have a dog in this race, but: Would it be OK to ship 3.0 with some
release notes and/or prominent documentation calling out this issue, and
then fixing it in 3.0.1?

On Sat, Mar 28, 2020 at 8:45 PM Jungtaek Lim 
wrote:

> I'd say SPARK-31257 as open blocker, because the change in upcoming Spark
> 3.0 made the create table be ambiguous, and once it's shipped it will be
> harder to correct again.
>
> On Sun, Mar 29, 2020 at 4:53 AM Reynold Xin  wrote:
>
>> Let's start cutting RC next week.
>>
>>
>> On Sat, Mar 28, 2020 at 11:51 AM, Sean Owen  wrote:
>>
>>> I'm also curious - there no open blockers for 3.0 but I know a few are
>>> still floating around open to revert changes. What is the status there?
>>> From my field of view I'm not aware of other blocking issues.
>>>
>>> On Fri, Mar 27, 2020 at 10:56 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Now the end of March is just around the corner. I'm not qualified to
 say (and honestly don't know) where we are, but if we were intended to be
 in blocker mode it doesn't seem to work; lots of developments still happen,
 and priority/urgency doesn't seem to be applied to the sequence of
 reviewing.

 How about listing (or linking to epic, or labelling) JIRA issues/PRs
 which are blockers (either from priority or technically) for Spark 3.0
 release, and make clear we should try to review these blockers
 first? Github PR label may help here to filter out other PRs and
 concentrate these things.

 Thanks,
 Jungtaek Lim (HeartSaVioR)

 On Wed, Mar 25, 2020 at 1:52 PM Xiao Li  wrote:

> Let us try to finish the remaining major blockers in the next few
> days. For example, https://issues.apache.org/jira/browse/SPARK-31085
>
> +1 to cut the RC even if we still have the blockers that will fail the
> RCs.
>
> Cheers,
>
> Xiao
>
>
> On Tue, Mar 24, 2020 at 6:56 PM Dongjoon Hyun 
> wrote:
>
>> +1
>>
>> Thanks,
>> Dongjoon.
>>
>> On Tue, Mar 24, 2020 at 14:49 Reynold Xin 
>> wrote:
>>
>>> I actually think we should start cutting RCs. We can cut RCs even
>>> with blockers.
>>>
>>>
>>> On Tue, Mar 24, 2020 at 12:51 PM, Dongjoon Hyun <
>>> dongjoon.h...@gmail.com> wrote:
>>>
 Hi, All.

 First of all, always "Community Over Code"!
 I wish you the best health and happiness.

 As we know, we are still working on QA period, we didn't reach RC
 stage. It seems that we need to make website up-to-date once more.

 https://spark.apache.org/versioning-policy.html

 If possible, it would be really great if we can get `3.0.0` release
 manager's official `branch-3.0` assessment because we have only 1 week
 before the end of March.

 Cloud you, the 3.0.0 release manager, share your thought and update
 the website, please?

 Bests
 Dongjoon.

>>>
>>>
>
> --
> 
>

>>

Automatic PR labeling

2020-03-24 Thread Nicholas Chammas

Public Service Announcement: There is a GitHub action that lets you
automatically label PRs based on what paths they modify.

https://github.com/actions/labeler

If we set this up, perhaps down the line we can update the PR dashboard and
PR merge script to use the tags.

cc @Dongjoon Hyun , who may be interested in this.

Nick

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-20 Thread Nicholas Chammas

On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan  wrote:

> 2. PARTITIONED BY colTypeList: I think we can support it in the unified
> syntax. Just make sure it doesn't appear together with PARTITIONED BY
> transformList.
>

Another side note: Perhaps as part of (or after) unifying the CREATE TABLE
syntax, we can also update Catalog.createTable() to support creating
partitioned tables .

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Nicholas Chammas

Side comment: The current docs for CREATE TABLE

add to the confusion by describing the Hive-compatible command as "CREATE
TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are
actually part of the syntax

.

On Wed, Mar 18, 2020 at 8:31 PM Ryan Blue  wrote:

> Jungtaek, it sounds like you consider the two rules to be separate
> syntaxes with their own consistency rules. For example, if I am using the
> Hive syntax rule, then the PARTITIONED BY clause adds new (partition)
> columns and requires types for those columns; if I’m using the Spark syntax
> rule with USING then PARTITIONED BY must reference existing columns and
> cannot include types.
>
> I agree that this is confusing to users! We should fix it, but I don’t
> think the right solution is to continue to have two rules with divergent
> syntax.
>
> This is confusing to users because they don’t know anything about separate
> parser rules. All the user sees is that sometimes PARTITION BY requires
> types and sometimes it doesn’t. Yes, we could add a keyword, HIVE, to
> signal that the syntax is borrowed from Hive for that case, but that
> actually breaks queries that run in Hive.
>
> I think the right solution is to unify the two syntaxes. I don’t think
> they are so different that it isn’t possible. Here are the differences I
> see:
>
>- Only in Hive:
>   - EXTERNAL
>   - skewSpec: SKEWED BY ...
>   - rowFormat: ROW FORMAT DELIMITED ..., ROW FORMAT SERDE ...
>   - createFileFormat: STORED AS ...
>- Only in Spark:
>   - OPTIONS property list
>- Different syntax/interpretation:
>   - PARTITIONED BY transformList / PARTITIONED BY colTypeList
>
> For the clauses that are supported in one but not the other, we can add
> them to a unified rule as optional clauses. The AST builder would then
> validate what makes sense or not (e.g., stored as with using or row format
> delimited) and finally pass the remaining data on using the
> CreateTableStatement. That statement would be handled like we do for the
> Spark rule today, but with extra metadata to pass along. This is also a
> step toward being able to put Hive behind the DSv2 API because we’d be able
> to pass all of the Hive metadata clauses to the v2 catalog.
>
> The only difficult part is handling PARTITIONED BY. But in that case, we
> can use two different syntaxes from the same CREATE TABLE rule. If types
> are included, we use the Hive PARTITIONED BY syntax and convert in the
> AST builder to normalize to a single representation.
>
> What do you both think? This would make the behavior more clear and take a
> step toward getting rid of Hive-specific code.
>
> On Wed, Mar 18, 2020 at 4:45 PM Jungtaek Lim 
> wrote:
>
>> I'm trying to understand the reason you have been suggesting to keep the
>> real thing unchanged but change doc instead. Could you please elaborate
>> why? End users would blame us when they hit the case their query doesn't
>> work as intended (1) and found the fact it's undocumented (2) and hard to
>> understand even from the Spark codebase (3).
>>
>> For me, addressing the root issue adopting your suggestion would be
>> "dropping the rule 2" and only supporting it with legacy config on. We
>> would say to end users, you need to enable the legacy config to leverage
>> Hive create table syntax, or just use beeline with Hive connected.
>>
>> But since we are even thinking about native syntax as a first class and
>> dropping Hive one implicitly (hide in doc) or explicitly, does it really
>> matter we require a marker (like "HIVE") in rule 2 and isolate it? It would
>> have even less confusion than Spark 2.x, since we will require end users to
>> fill the marker regarding Hive when creating Hive table, easier to classify
>> than "USING provider".
>>
>> If we think native syntax would cover many cases end users have been
>> creating Hive table in Spark (say, USING hive would simply work for them),
>> I'm OK to drop the rule 2 and lead end users to enable the legacy config if
>> really needed. If not, let's continue "fixing" the issue.
>>
>> (Another valid approach would be consolidating two rules into one, and
>> defining support of parameters per provider, e.g. EXTERNAL, STORED AS, ROW
>> FORMAT, etc. are only supported in Hive provider.)
>>
>>
>> On Wed, Mar 18, 2020 at 8:47 PM Wenchen Fan  wrote:
>>
>>> The fact that we have 2 CREATE TABLE syntax is already confusing many
>>> users. Shall we only document the native syntax? Then users don't need to
>>> worry about which rule their query fits and they don't need to spend a lot
>>> of time understanding the subtle difference between these 2 syntaxes.
>>>
>>> On Wed, Mar 18, 2020 at 7:01 PM Jungtaek Lim <
>>>

Re-triggering failed GitHub workflows

2020-03-16 Thread Nicholas Chammas

Is there any way contributors can retrigger a failed GitHub workflow, like
we do with Jenkins? There's supposed to be a "Re-run all checks" button,
but I don't see it.

Do we need INFRA to grant permissions for that, perhaps?

Right now I'm doing it by adding empty commits:

```
git commit --allow-empty -m "re-trigger GitHub tests"
```

Nick

Re: Running Spark through a debugger

2020-03-12 Thread Nicholas Chammas

Finally revisiting this issue. I can now build Spark through IntelliJ (I
had to delete `.idea/` and reload the project, making sure to import the
Maven config), but am struggling to get the breakpoint/debugging to work.

I setup IntelliJ per the instructions under Debug Spark Remotely
<http://spark.apache.org/developer-tools.html>, but the debugger never
seems to connect to the JVM. If I switch the debugger mode from "Listen to
remote JVM" (which is recommended by the guide) to "Attach to remote JVM",
I can get the debugger to talk to the JVM, but my breakpoint still doesn't
fire.

So to elaborate, I have a) a breakpoint set inside CatalogSuite.scala, and
b) the IntelliJ debugger running in "Listen" mode. Then I run this:

```
$ ./build/sbt
> set javaOptions in Test +=
"-agentlib:jdwp=transport=dt_socket,server=n,address=192.168.1.146:5005,suspend=y,onthrow=,onuncaught="
> testOnly org.apache.spark.sql.internal.CatalogSuite
```

(I copied those Java options from the IntelliJ debug config window.)

But the breakpoint doesn't fire. Any ideas as to what I'm missing?

On Tue, Dec 17, 2019 at 12:36 AM Sean Owen  wrote:

> I just make a new test suite or something, set breakpoints, and
> execute it in IJ. That generally works fine. You may need to set the
> run configuration to have the right working dir (Spark project root),
> and set the right system property to say 'this is running in a test'
> in some cases. What are you having trouble with, does it build?
>
> On Mon, Dec 16, 2019 at 11:27 PM Nicholas Chammas
>  wrote:
> >
> > I normally stick to the Python parts of Spark, but I am interested in
> walking through the DSv2 code and understanding how it works. I tried
> following the "IDE Setup" section of the developer tools page, but quickly
> hit several problems loading the project.
> >
> > Can anyone point to a tutorial or guide on running Spark through a
> debugger, for folks like me who are not regular IntelliJ or Scala users?
> >
> > My goal is just to be able to run something like
> `spark.read.text(...).show()` and follow it step by step through the
> codebase.
> >
> > Nick
> >
>

Re: Auto-linking from PRs to Jira tickets

2020-03-10 Thread Nicholas Chammas

Could you point us to the ticket? I'd like to follow along.

On Tue, Mar 10, 2020 at 9:13 AM Alex Ott  wrote:

> For Zeppelin I've created recently the ASF INFRA Jira for that feature...
> Although maybe it should be done for all projects.
>
> Nicholas Chammas  at "Mon, 9 Mar 2020 15:27:30 -0400" wrote:
>  NC> https://github.blog/2019-10-14-introducing-autolink-references/
>
>  NC> GitHub has a feature for auto-linking from PRs to external tickets.
> It's only available for their paid plans, but perhaps Apache has some
>  NC> arrangement with them where we can get that feature.
>
>  NC> Since we include Jira ticket numbers in every PR title, it would be
> great if each PR auto-linked back to the relevant Jira tickets. (We
>  NC> already have auto-linking from Jira to PRs.)
>
>  NC> Has someone looked into this already, or should I file a ticket with
> INFRA and see what they say?
>
>  NC> Nick
>
>
>
> --
> With best wishes,Alex Ott
> http://alexott.net/
> Twitter: alexott_en (English), alexott (Russian)
>

Re: Auto-linking from PRs to Jira tickets

2020-03-09 Thread Nicholas Chammas

Right, what I'm talking about is linking in the other direction, from
GitHub to Jira.

i.e. you can type "SPARK-1234" in plain text on a PR, and GitHub will
automatically turn it into a link to the appropriate ticket on Jira.

On Mon, Mar 9, 2020 at 8:21 PM Holden Karau  wrote:

>
>
> On Mon, Mar 9, 2020 at 2:14 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> This is a feature of GitHub itself and would auto-link directly from the
>> PR back to Jira.
>>
>> I haven't looked at the PR dashboard in a while, but I believe you're
>> referencing a feature of the dashboard <https://spark-prs.appspot.com> that
>> people won't get unless they look at the dashboard itself.
>>
>> What GitHub is offering is an ability to auto-link any mention of a Jira
>> ticket anywhere in a PR discussion (and hopefully also in the PR title,
>> though I'm not sure) directly back to Jira.
>>
> so the dashboard has a bot which would update the JIRA tickets based on
> the PRs. It might be broken though.
>
>>
>> I suppose if you're in the habit of using the dashboard regularly it
>> won't make a big difference. I typically land on a PR via a notification in
>> GitHub or via email. If I want to lookup the referenced Jira ticket, I have
>> to copy it from the PR title and navigate to issues.apache.org and paste
>> it in.
>>
>> On Mon, Mar 9, 2020 at 4:46 PM Holden Karau  wrote:
>>
>>> I think we used to do this with the same bot that runs the PR dashboard,
>>> is it no longer working?
>>>
>>> On Mon, Mar 9, 2020 at 12:28 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> https://github.blog/2019-10-14-introducing-autolink-references/
>>>>
>>>> GitHub has a feature for auto-linking from PRs to external tickets.
>>>> It's only available for their paid plans, but perhaps Apache has some
>>>> arrangement with them where we can get that feature.
>>>>
>>>> Since we include Jira ticket numbers in every PR title, it would be
>>>> great if each PR auto-linked back to the relevant Jira tickets. (We already
>>>> have auto-linking from Jira to PRs.)
>>>>
>>>> Has someone looked into this already, or should I file a ticket with
>>>> INFRA and see what they say?
>>>>
>>>> Nick
>>>>
>>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: Auto-linking from PRs to Jira tickets

2020-03-09 Thread Nicholas Chammas

This is a feature of GitHub itself and would auto-link directly from the PR
back to Jira.

I haven't looked at the PR dashboard in a while, but I believe you're
referencing a feature of the dashboard <https://spark-prs.appspot.com> that
people won't get unless they look at the dashboard itself.

What GitHub is offering is an ability to auto-link any mention of a Jira
ticket anywhere in a PR discussion (and hopefully also in the PR title,
though I'm not sure) directly back to Jira.

I suppose if you're in the habit of using the dashboard regularly it won't
make a big difference. I typically land on a PR via a notification in
GitHub or via email. If I want to lookup the referenced Jira ticket, I have
to copy it from the PR title and navigate to issues.apache.org and paste it
in.

On Mon, Mar 9, 2020 at 4:46 PM Holden Karau  wrote:

> I think we used to do this with the same bot that runs the PR dashboard,
> is it no longer working?
>
> On Mon, Mar 9, 2020 at 12:28 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> https://github.blog/2019-10-14-introducing-autolink-references/
>>
>> GitHub has a feature for auto-linking from PRs to external tickets. It's
>> only available for their paid plans, but perhaps Apache has some
>> arrangement with them where we can get that feature.
>>
>> Since we include Jira ticket numbers in every PR title, it would be great
>> if each PR auto-linked back to the relevant Jira tickets. (We already have
>> auto-linking from Jira to PRs.)
>>
>> Has someone looked into this already, or should I file a ticket with
>> INFRA and see what they say?
>>
>> Nick
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Auto-linking from PRs to Jira tickets

2020-03-09 Thread Nicholas Chammas

https://github.blog/2019-10-14-introducing-autolink-references/

GitHub has a feature for auto-linking from PRs to external tickets. It's
only available for their paid plans, but perhaps Apache has some
arrangement with them where we can get that feature.

Since we include Jira ticket numbers in every PR title, it would be great
if each PR auto-linked back to the relevant Jira tickets. (We already have
auto-linking from Jira to PRs.)

Has someone looked into this already, or should I file a ticket with INFRA
and see what they say?

Nick

Re: [DISCUSSION] Avoiding duplicate work

2020-02-21 Thread Nicholas Chammas

+1 to what Sean said.

On Fri, Feb 21, 2020 at 10:14 AM Sean Owen  wrote:

> We've avoided using Assignee because it implies that someone 'owns'
> resolving the issue, when we want to keep it collaborative, and many
> times in the past someone would ask to be assigned and then didn't
> follow through.
>
> You can comment on the JIRA to say "I'm working on this" but that has
> the same problem. Frequently people see that and don't work on it, and
> then the original person doesn't follow through either.
>
> The best practice is probably to write down your analysis of the
> problem and solution so far in a comment. That helps everyone and
> doesn't suggest others shouldn't work on it; we want them to, we want
> them to work together. That also shows some commitment to working on
> it.
>
>
> On Fri, Feb 21, 2020 at 9:11 AM younggyu Chun 
> wrote:
> >
> > what if both are looking at code and they don't make a merge request? I
> guess we can't still see what's going on because that Jira ticket won't
> show the linked PR.
> >
> > On Fri, 21 Feb 2020 at 09:58, Wenchen Fan  wrote:
> >>
> >> The JIRA ticket will show the linked PR if there are any, which
> indicates that someone is working on it if the PR is active. Maybe the bot
> should also leave a comment on the JIRA ticket to make it clearer?
> >>
> >> On Fri, Feb 21, 2020 at 10:54 PM younggyu Chun <
> younggyuchu...@gmail.com> wrote:
> >>>
> >>> Hi All,
> >>>
> >>> I would like to suggest to use "Assignee" functionality in the JIRA
> when we are working on a project. When we pick a ticket to work on we don't
> know who is doing that right now.
> >>>
> >>> Recently I spent my time to solve an issue and made a merge request
> but this was actually a duplicate work. The ticket I was working on doesn't
> have any clues that somebody was working.
> >>>
> >>> are there ways to avoid duplicate work that I don't know yet?
> >>>
> >>> Thank you,
> >>> Younggyu
> >>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: More publicly documenting the options under spark.sql.*

2020-01-27 Thread Nicholas Chammas

I am! Thanks for the reference.

On Thu, Jan 16, 2020 at 9:53 PM Hyukjin Kwon  wrote:

> Nicholas, are you interested in taking a stab at this? You could refer
> https://github.com/apache/spark/commit/60472dbfd97acfd6c4420a13f9b32bc9d84219f3
>
> 2020년 1월 17일 (금) 오전 8:48, Takeshi Yamamuro 님이 작성:
>
>> The idea looks nice. I think web documents always help end users.
>>
>> Bests,
>> Takeshi
>>
>> On Fri, Jan 17, 2020 at 4:04 AM Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> "spark.sql("set -v")" returns a Dataset that has all non-internal SQL
>>> configurations. Should be pretty easy to automatically generate a SQL
>>> configuration page.
>>>
>>> Best Regards,
>>> Ryan
>>>
>>>
>>> On Wed, Jan 15, 2020 at 5:47 AM Hyukjin Kwon 
>>> wrote:
>>>
>>>> I think automatically creating a configuration page isn't a bad idea
>>>> because I think we deprecate and remove configurations which are not
>>>> created via .internal() in SQLConf anyway.
>>>>
>>>> I already tried this automatic generation from the codes at SQL
>>>> built-in functions and I'm pretty sure we can do the similar thing for
>>>> configurations as well.
>>>>
>>>> We could perhaps mimic what hadoop does
>>>> https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/core-default.xml
>>>>
>>>> On Wed, 15 Jan 2020, 10:46 Sean Owen,  wrote:
>>>>
>>>>> Some of it is intentionally undocumented, as far as I know, as an
>>>>> experimental option that may change, or legacy, or safety valve flag.
>>>>> Certainly anything that's marked an internal conf. (That does raise
>>>>> the question of who it's for, if you have to read source to find it.)
>>>>>
>>>>> I don't know if we need to overhaul the conf system, but there may
>>>>> indeed be some confs that could legitimately be documented. I don't
>>>>> know which.
>>>>>
>>>>> On Tue, Jan 14, 2020 at 7:32 PM Nicholas Chammas
>>>>>  wrote:
>>>>> >
>>>>> > I filed SPARK-30510 thinking that we had forgotten to document an
>>>>> option, but it turns out that there's a whole bunch of stuff under
>>>>> SQLConf.scala that has no public documentation under
>>>>> http://spark.apache.org/docs.
>>>>> >
>>>>> > Would it be appropriate to somehow automatically generate a
>>>>> documentation page from SQLConf.scala, as Hyukjin suggested on that 
>>>>> ticket?
>>>>> >
>>>>> > Another thought that comes to mind is moving the config definitions
>>>>> out of Scala and into a data format like YAML or JSON, and then sourcing
>>>>> that both for SQLConf as well as for whatever documentation page we want 
>>>>> to
>>>>> generate. What do you think of that idea?
>>>>> >
>>>>> > Nick
>>>>> >
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

Re: Closing stale PRs with a GitHub Action

2020-01-27 Thread Nicholas Chammas

A brief update here: At the start of December when I started this thread we
had almost 500 open PRs. Now that the Stale workflow has had time to catch
up, we're down to ~280 open PRs.

More impressive than the number of stale PRs that got closed
<https://github.com/apache/spark/pulls?q=is%3Apr+label%3AStale+is%3Aclosed>
is how many PRs are active with relatively recent activity. It's a
testament to how active this project is.

On Sun, Dec 15, 2019 at 11:16 AM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Just an FYI to everyone, we’ve merged in an Action to close stale PRs:
> https://github.com/apache/spark/pull/26877
>
> 2019년 12월 8일 (일) 오전 9:49, Hyukjin Kwon 님이 작성:
>
>> It doesn't need to exactly follow the conditions I used before as long as
>> Github Actions can provide other good options or conditions.
>> I just wanted to make sure the condition is reasonable.
>>
>> 2019년 12월 7일 (토) 오전 11:23, Hyukjin Kwon 님이 작성:
>>
>>> lol how did you know I'm going to read this email Sean?
>>>
>>> When I manually identified the stale PRs, I used this conditions below:
>>>
>>> 1. Author's inactivity over a year. If the PRs were simply waiting for a
>>> review, I excluded it from stale PR list.
>>> 2. Ping one time and see if there are any updates within 3 days.
>>> 3. If it meets both conditions above, they were considered as stale PRs.
>>>
>>> Yeah, I agree with it. But I think the conditions of stale PRs matter.
>>> What kind of conditions and actions the Github Actions support, and
>>> which of them do you plan to add?
>>>
>>> I didn't like to close and automate the stale PRs but I think it's time
>>> to consider. But I think the conditions have to be pretty reasonable
>>> so that we give a proper reason to the author and/or don't happen to
>>> close some good and worthy PRs.
>>>
>>>
>>> 2019년 12월 7일 (토) 오전 3:23, Sean Owen 님이 작성:
>>>
>>>> We used to not be able to close PRs directly, but now we can, so I
>>>> assume this is as fine a way of doing so, if we want to. I don't think
>>>> there's a policy against it or anything.
>>>> Hyukjin how have you managed this one in the past?
>>>> I don't mind it being automated if the idle time is long and it posts
>>>> some friendly message about reopening if there is a material change in the
>>>> proposed PR, the problem, or interest in merging it.
>>>>
>>>> On Fri, Dec 6, 2019 at 11:20 AM Nicholas Chammas <
>>>> nicholas.cham...@gmail.com> wrote:
>>>>
>>>>> That's true, we do use Actions today. I wonder if Apache Infra allows
>>>>> Actions to close PRs vs. just updating commit statuses. I only ask because
>>>>> I remember permissions were an issue in the past when discussing tooling
>>>>> like this.
>>>>>
>>>>> In any case, I'd be happy to submit a PR adding this in if there are
>>>>> no concerns. We can hash out the details on the PR.
>>>>>
>>>>> On Fri, Dec 6, 2019 at 11:08 AM Sean Owen  wrote:
>>>>>
>>>>>> I think we can add Actions, right? they're used for the newer tests
>>>>>> in Github?
>>>>>> I'm OK closing PRs inactive for a 'long time', where that's maybe
>>>>>> 6-12 months or something. It's standard practice and doesn't mean it 
>>>>>> can't
>>>>>> be reopened.
>>>>>> Often the related JIRA should be closed as well but we have done that
>>>>>> separately with bulk-close in the past.
>>>>>>
>>>>>> On Thu, Dec 5, 2019 at 3:24 PM Nicholas Chammas <
>>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>>
>>>>>>> It’s that topic again. 
>>>>>>>
>>>>>>> We have almost 500 open PRs. A good chunk of them are more than a
>>>>>>> year old. The oldest open PR dates to summer 2015.
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/apache/spark/pulls?q=is%3Apr+is%3Aopen+sort%3Acreated-asc
>>>>>>>
>>>>>>> GitHub has an Action for closing stale PRs.
>>>>>>>
>>>>>>> https://github.com/marketplace/actions/close-stale-issues
>>>>>>>
>>>>>>> What do folks think about deploying it? Does Apache Infra give us
>>>>>>> the ability to even deploy a tool like this?
>>>>>>>
>>>>>>> Nick
>>>>>>>
>>>>>>

Re: More publicly documenting the options under spark.sql.*

2020-01-15 Thread Nicholas Chammas

So do we want to repurpose
SPARK-30510 as an SQL config refactor?

Alternatively, what’s the smallest step forward I can take to publicly
document partitionOverwriteMode (which was my impetus for looking into this
in the first place)?

2020년 1월 15일 (수) 오전 8:49, Hyukjin Kwon 님이 작성:

> Resending to the dev list for archive purpose:
>
> I think automatically creating a configuration page isn't a bad idea
> because I think we deprecate and remove configurations which are not
> created via .internal() in SQLConf anyway.
>
> I already tried this automatic generation from the codes at SQL built-in
> functions and I'm pretty sure we can do the similar thing for
> configurations as well.
>
> We could perhaps mimic what hadoop does
> https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/core-default.xml
>
> On Wed, 15 Jan 2020, 22:46 Hyukjin Kwon,  wrote:
>
>> I think automatically creating a configuration page isn't a bad idea
>> because I think we deprecate and remove configurations which are not
>> created via .internal() in SQLConf anyway.
>>
>> I already tried this automatic generation from the codes at SQL built-in
>> functions and I'm pretty sure we can do the similar thing for
>> configurations as well.
>>
>> We could perhaps mimic what hadoop does
>> https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/core-default.xml
>>
>> On Wed, 15 Jan 2020, 10:46 Sean Owen,  wrote:
>>
>>> Some of it is intentionally undocumented, as far as I know, as an
>>> experimental option that may change, or legacy, or safety valve flag.
>>> Certainly anything that's marked an internal conf. (That does raise
>>> the question of who it's for, if you have to read source to find it.)
>>>
>>> I don't know if we need to overhaul the conf system, but there may
>>> indeed be some confs that could legitimately be documented. I don't
>>> know which.
>>>
>>> On Tue, Jan 14, 2020 at 7:32 PM Nicholas Chammas
>>>  wrote:
>>> >
>>> > I filed SPARK-30510 thinking that we had forgotten to document an
>>> option, but it turns out that there's a whole bunch of stuff under
>>> SQLConf.scala that has no public documentation under
>>> http://spark.apache.org/docs.
>>> >
>>> > Would it be appropriate to somehow automatically generate a
>>> documentation page from SQLConf.scala, as Hyukjin suggested on that ticket?
>>> >
>>> > Another thought that comes to mind is moving the config definitions
>>> out of Scala and into a data format like YAML or JSON, and then sourcing
>>> that both for SQLConf as well as for whatever documentation page we want to
>>> generate. What do you think of that idea?
>>> >
>>> > Nick
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Running Spark through a debugger

2019-12-16 Thread Nicholas Chammas

I normally stick to the Python parts of Spark, but I am interested in
walking through the DSv2 code and understanding how it works. I tried
following the "IDE Setup" section of the developer tools
 page, but quickly hit
several problems loading the project.

Can anyone point to a tutorial or guide on running Spark through a
debugger, for folks like me who are not regular IntelliJ or Scala users?

My goal is just to be able to run something like
`spark.read.text(...).show()` and follow it step by step through the
codebase.

Nick

Re: Closing stale PRs with a GitHub Action

2019-12-15 Thread Nicholas Chammas

Just an FYI to everyone, we’ve merged in an Action to close stale PRs:
https://github.com/apache/spark/pull/26877

2019년 12월 8일 (일) 오전 9:49, Hyukjin Kwon 님이 작성:

> It doesn't need to exactly follow the conditions I used before as long as
> Github Actions can provide other good options or conditions.
> I just wanted to make sure the condition is reasonable.
>
> 2019년 12월 7일 (토) 오전 11:23, Hyukjin Kwon 님이 작성:
>
>> lol how did you know I'm going to read this email Sean?
>>
>> When I manually identified the stale PRs, I used this conditions below:
>>
>> 1. Author's inactivity over a year. If the PRs were simply waiting for a
>> review, I excluded it from stale PR list.
>> 2. Ping one time and see if there are any updates within 3 days.
>> 3. If it meets both conditions above, they were considered as stale PRs.
>>
>> Yeah, I agree with it. But I think the conditions of stale PRs matter.
>> What kind of conditions and actions the Github Actions support, and which
>> of them do you plan to add?
>>
>> I didn't like to close and automate the stale PRs but I think it's time
>> to consider. But I think the conditions have to be pretty reasonable
>> so that we give a proper reason to the author and/or don't happen to
>> close some good and worthy PRs.
>>
>>
>> 2019년 12월 7일 (토) 오전 3:23, Sean Owen 님이 작성:
>>
>>> We used to not be able to close PRs directly, but now we can, so I
>>> assume this is as fine a way of doing so, if we want to. I don't think
>>> there's a policy against it or anything.
>>> Hyukjin how have you managed this one in the past?
>>> I don't mind it being automated if the idle time is long and it posts
>>> some friendly message about reopening if there is a material change in the
>>> proposed PR, the problem, or interest in merging it.
>>>
>>> On Fri, Dec 6, 2019 at 11:20 AM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> That's true, we do use Actions today. I wonder if Apache Infra allows
>>>> Actions to close PRs vs. just updating commit statuses. I only ask because
>>>> I remember permissions were an issue in the past when discussing tooling
>>>> like this.
>>>>
>>>> In any case, I'd be happy to submit a PR adding this in if there are no
>>>> concerns. We can hash out the details on the PR.
>>>>
>>>> On Fri, Dec 6, 2019 at 11:08 AM Sean Owen  wrote:
>>>>
>>>>> I think we can add Actions, right? they're used for the newer tests in
>>>>> Github?
>>>>> I'm OK closing PRs inactive for a 'long time', where that's maybe 6-12
>>>>> months or something. It's standard practice and doesn't mean it can't be
>>>>> reopened.
>>>>> Often the related JIRA should be closed as well but we have done that
>>>>> separately with bulk-close in the past.
>>>>>
>>>>> On Thu, Dec 5, 2019 at 3:24 PM Nicholas Chammas <
>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>
>>>>>> It’s that topic again. 
>>>>>>
>>>>>> We have almost 500 open PRs. A good chunk of them are more than a
>>>>>> year old. The oldest open PR dates to summer 2015.
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/spark/pulls?q=is%3Apr+is%3Aopen+sort%3Acreated-asc
>>>>>>
>>>>>> GitHub has an Action for closing stale PRs.
>>>>>>
>>>>>> https://github.com/marketplace/actions/close-stale-issues
>>>>>>
>>>>>> What do folks think about deploying it? Does Apache Infra give us the
>>>>>> ability to even deploy a tool like this?
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>

R linter is broken

2019-12-13 Thread Nicholas Chammas

The R linter GitHub action seems to be busted
.
Looks like we need to update some repository references

?

Nick

Re: Closing stale PRs with a GitHub Action

2019-12-06 Thread Nicholas Chammas

That's true, we do use Actions today. I wonder if Apache Infra allows
Actions to close PRs vs. just updating commit statuses. I only ask because
I remember permissions were an issue in the past when discussing tooling
like this.

In any case, I'd be happy to submit a PR adding this in if there are no
concerns. We can hash out the details on the PR.

On Fri, Dec 6, 2019 at 11:08 AM Sean Owen  wrote:

> I think we can add Actions, right? they're used for the newer tests in
> Github?
> I'm OK closing PRs inactive for a 'long time', where that's maybe 6-12
> months or something. It's standard practice and doesn't mean it can't be
> reopened.
> Often the related JIRA should be closed as well but we have done that
> separately with bulk-close in the past.
>
> On Thu, Dec 5, 2019 at 3:24 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> It’s that topic again. 
>>
>> We have almost 500 open PRs. A good chunk of them are more than a year
>> old. The oldest open PR dates to summer 2015.
>>
>>
>> https://github.com/apache/spark/pulls?q=is%3Apr+is%3Aopen+sort%3Acreated-asc
>>
>> GitHub has an Action for closing stale PRs.
>>
>> https://github.com/marketplace/actions/close-stale-issues
>>
>> What do folks think about deploying it? Does Apache Infra give us the
>> ability to even deploy a tool like this?
>>
>> Nick
>>
>

Closing stale PRs with a GitHub Action

2019-12-05 Thread Nicholas Chammas

It’s that topic again. 

We have almost 500 open PRs. A good chunk of them are more than a year old.
The oldest open PR dates to summer 2015.

https://github.com/apache/spark/pulls?q=is%3Apr+is%3Aopen+sort%3Acreated-asc

GitHub has an Action for closing stale PRs.

https://github.com/marketplace/actions/close-stale-issues

What do folks think about deploying it? Does Apache Infra give us the
ability to even deploy a tool like this?

Nick

Re: Auto-linking Jira tickets to their PRs

2019-12-03 Thread Nicholas Chammas

Hmm, looks like something weird is going on, since it seems to be working
here <https://issues.apache.org/jira/browse/SPARK-30091> and here
<https://issues.apache.org/jira/browse/SPARK-30113>. Perhaps it was
temporarily broken and is now working again.

On Tue, Dec 3, 2019 at 8:35 PM Hyukjin Kwon  wrote:

> I think it's broken .. cc Josh Rosen
>
> 2019년 12월 4일 (수) 오전 10:25, Nicholas Chammas 님이
> 작성:
>
>> We used to have a bot or something that automatically linked Jira tickets
>> to PRs that mentioned them in their title. I don't see that happening
>> anymore. <https://issues.apache.org/jira/browse/SPARK-29903>
>>
>> Did we intentionally remove this functionality, or is it temporarily
>> broken for some reason?
>>
>> Nick
>>
>>

Auto-linking Jira tickets to their PRs

2019-12-03 Thread Nicholas Chammas

We used to have a bot or something that automatically linked Jira tickets
to PRs that mentioned them in their title. I don't see that happening
anymore. 

Did we intentionally remove this functionality, or is it temporarily broken
for some reason?

Nick

Re: Can't build unidoc

2019-11-29 Thread Nicholas Chammas

That worked. Thanks Sean. Going forward, I will try that as a
troubleshooting step before posting on the dev list.

On Fri, Nov 29, 2019 at 1:04 PM Sean Owen  wrote:

> I'm not seeing that error for either command. Try blowing away your
> local .ivy / .m2 dir?
>
> On Fri, Nov 29, 2019 at 11:48 AM Nicholas Chammas
>  wrote:
> >
> > Howdy folks. Running `./build/sbt unidoc` on the latest master is giving
> me this trace:
> >
> > ```
> > [warn]  ::
> > [warn]  ::  UNRESOLVED DEPENDENCIES ::
> > [warn]  ::
> > [warn]  :: commons-collections#commons-collections;3.2.2:
> commons-collections#commons-collections;3.2.2!commo
> > ns-collections.pom(pom.original) origin location must be absolute:
> file:/Users/myusername/.m2/repository
> >
> /commons-collections/commons-collections/3.2.2/commons-collections-3.2.2.pom
> > [warn]  ::
> > [warn]
> > [warn]  Note: Unresolved dependencies path:
> > [warn]  commons-collections:commons-collections:3.2.2
> > [warn]+- commons-beanutils:commons-beanutils:1.9.4
> > [warn]+- com.puppycrawl.tools:checkstyle:8.25
> (/Users/myusername/Projects/nchammas/spark/pro
> > ject/plugins.sbt#L21-22)
> > [warn]+- default:spark-build:0.1-SNAPSHOT
> (scalaVersion=2.10, sbtVersion=0.13)
> > sbt.ResolveException: unresolved dependency:
> commons-collections#commons-collections;3.2.2: commons-collectio
> > ns#commons-collections;3.2.2!commons-collections.pom(pom.original)
> origin location must be absolute: file:/Us
> >
> ers/myusername/.m2/repository/commons-collections/commons-collections/3.2.2/commons-collections-3.2.2.po
> > m
> > at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:320)
> > at
> sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:191)
> > at
> sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:168)
> > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
> > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
> > at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:133)
> > at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
> > at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
> > at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
> > at
> xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)
> > at
> xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
> > at xsbt.boot.Using$.withResource(Using.scala:10)
> > at xsbt.boot.Using$.apply(Using.scala:9)
> > at
> xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
> > at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
> > at xsbt.boot.Locks$.apply0(Locks.scala:31)
> > at xsbt.boot.Locks$.apply(Locks.scala:28)
> > at sbt.IvySbt.withDefaultLogger(Ivy.scala:65)
> > at sbt.IvySbt.withIvy(Ivy.scala:128)
> > at sbt.IvySbt.withIvy(Ivy.scala:125)
> > at sbt.IvySbt$Module.withModule(Ivy.scala:156)
> > at sbt.IvyActions$.updateEither(IvyActions.scala:168)
> > at
> sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1555)
> > at
> sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1551)
> > at
> sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1586)
> > at
> sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1584)
> > at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:37)
> > at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1589)
> > at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1583)
> > at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:60)
> > at sbt.Classpaths$.cachedUpdate(Defaults.scala:1606)
> > at
> sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1533)
> > at
> sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1485)
> > at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
> > at
> sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
> > at sbt.std.Transform$$anon$4.work(System.scala:63)
> > at
> sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
> > at
> sbt.Execute$$anonfun$submit$1$$anonfun$ap

Can't build unidoc

2019-11-29 Thread Nicholas Chammas

Howdy folks. Running `./build/sbt unidoc` on the latest master is giving me
this trace:

```
[warn]  ::
[warn]  ::  UNRESOLVED DEPENDENCIES ::
[warn]  ::
[warn]  :: commons-collections#commons-collections;3.2.2:
commons-collections#commons-collections;3.2.2!commo
ns-collections.pom(pom.original) origin location must be absolute:
file:/Users/myusername/.m2/repository
/commons-collections/commons-collections/3.2.2/commons-collections-3.2.2.pom
[warn]  ::
[warn]
[warn]  Note: Unresolved dependencies path:
[warn]  commons-collections:commons-collections:3.2.2
[warn]+- commons-beanutils:commons-beanutils:1.9.4
[warn]+- com.puppycrawl.tools:checkstyle:8.25
(/Users/myusername/Projects/nchammas/spark/pro
ject/plugins.sbt#L21-22)
[warn]+- default:spark-build:0.1-SNAPSHOT (scalaVersion=2.10,
sbtVersion=0.13)
sbt.ResolveException: unresolved dependency:
commons-collections#commons-collections;3.2.2: commons-collectio
ns#commons-collections;3.2.2!commons-collections.pom(pom.original) origin
location must be absolute: file:/Us
ers/myusername/.m2/repository/commons-collections/commons-collections/3.2.2/commons-collections-3.2.2.po
m
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:320)
at
sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:191)
at
sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:168)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:133)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
at
xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)
at
xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
at xsbt.boot.Using$.withResource(Using.scala:10)
at xsbt.boot.Using$.apply(Using.scala:9)
at
xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:65)
at sbt.IvySbt.withIvy(Ivy.scala:128)
at sbt.IvySbt.withIvy(Ivy.scala:125)
at sbt.IvySbt$Module.withModule(Ivy.scala:156)
at sbt.IvyActions$.updateEither(IvyActions.scala:168)
at
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1555)
at
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1551)
at
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1586)
at
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1584)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:37)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1589)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1583)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:60)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1606)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1533)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1485)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at
sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
at sbt.std.Transform$$anon$4.work(System.scala:63)
at
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.Execute.work(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[error] (*:update) sbt.ResolveException: unresolved dependency:
commons-collections#commons-collections;3.2.2:

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-19 Thread Nicholas Chammas

> I don't think the default Hadoop version matters except for the
spark-hadoop-cloud module, which is only meaningful under the hadoop-3.2
profile.

What do you mean by "only meaningful under the hadoop-3.2 profile"?

On Tue, Nov 19, 2019 at 5:40 PM Cheng Lian  wrote:

> Hey Steve,
>
> In terms of Maven artifact, I don't think the default Hadoop version
> matters except for the spark-hadoop-cloud module, which is only meaningful
> under the hadoop-3.2 profile. All  the other spark-* artifacts published to
> Maven central are Hadoop-version-neutral.
>
> Another issue about switching the default Hadoop version to 3.2 is PySpark
> distribution. Right now, we only publish PySpark artifacts prebuilt with
> Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to
> 3.2 is feasible for PySpark users. Or maybe we should publish PySpark
> prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.
>
> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the
> proposed hive-2.3 profile, I personally don't have a preference over having
> Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing
> the release management work, in case we decided to publish other spark-*
> Maven artifacts from a Hadoop 2.7 build, we can still special case
> spark-hadoop-cloud and publish it using a hadoop-3.2 build.
>
> On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun 
> wrote:
>
>> I also agree with Steve and Felix.
>>
>> Let's have another thread to discuss Hive issue
>>
>> because this thread was originally for `hadoop` version.
>>
>> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
>> `hadoop-3.0` versions.
>>
>> We don't need to mix both.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung 
>> wrote:
>>
>>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution.
>>> It is old and rather buggy; and It’s been *years*
>>>
>>> I think we should decouple hive change from everything else if people
>>> are concerned?
>>>
>>> --
>>> *From:* Steve Loughran 
>>> *Sent:* Sunday, November 17, 2019 9:22:09 AM
>>> *To:* Cheng Lian 
>>> *Cc:* Sean Owen ; Wenchen Fan ;
>>> Dongjoon Hyun ; dev ;
>>> Yuming Wang 
>>> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>>
>>> Can I take this moment to remind everyone that the version of hive which
>>> spark has historically bundled (the org.spark-project one) is an orphan
>>> project put together to deal with Hive's shading issues and a source of
>>> unhappiness in the Hive project. What ever get shipped should do its best
>>> to avoid including that file.
>>>
>>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest
>>> move from a risk minimisation perspective. If something has broken then it
>>> is you can start with the assumption that it is in the o.a.s packages
>>> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if
>>> there are problems with the hadoop / hive dependencies those teams will
>>> inevitably ignore filed bug reports for the same reason spark team will
>>> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the
>>> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that
>>> in mind. It's not been tested, it has dependencies on artifacts we know are
>>> incompatible, and as far as the Hadoop project is concerned: people should
>>> move to branch 3 if they want to run on a modern version of Java
>>>
>>> It would be really really good if the published spark maven artefacts
>>> (a) included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop
>>> 3.x. That way people doing things with their own projects will get
>>> up-to-date dependencies and don't get WONTFIX responses themselves.
>>>
>>> -Steve
>>>
>>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last
>>> ever" branch-2 release and then declare its predecessors EOL; 2.10 will be
>>> the transition release.
>>>
>>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian 
>>> wrote:
>>>
>>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
>>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
>>> seemed risky, and therefore we only introduced Hive 2.3 under the
>>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
>>> here...
>>>
>>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed
>>> that Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
>>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
>>> upgrade together looks too risky.
>>>
>>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen  wrote:
>>>
>>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
>>> than introduce yet another build combination. Does Hadoop 2 + Hive 2
>>> work and is there demand for it?
>>>
>>> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan  wrote:
>>> >
>>> > Do we

Re: [ANNOUNCE] Announcing Apache Spark 3.0.0-preview

2019-11-16 Thread Nicholas Chammas

> Data Source API with Catalog Supports

Where can we read more about this? The linked Nabble thread doesn't mention
the word "Catalog".

On Thu, Nov 7, 2019 at 5:53 PM Xingbo Jiang  wrote:

> Hi all,
>
> To enable wide-scale community testing of the upcoming Spark 3.0 release,
> the Apache Spark community has posted a preview release of Spark 3.0. This
> preview is *not a stable release in terms of either API or functionality*,
> but it is meant to give the community early access to try the code that
> will become Spark 3.0. If you would like to test the release, please
> download it, and send feedback using either the mailing lists
>  or JIRA
> 
> .
>
> There are a lot of exciting new features added to Spark 3.0, including
> Dynamic Partition Pruning, Adaptive Query Execution, Accelerator-aware
> Scheduling, Data Source API with Catalog Supports, Vectorization in SparkR,
> support of Hadoop 3/JDK 11/Scala 2.12, and many more. For a full list of
> major features and changes in Spark 3.0.0-preview, please check the thread(
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-3-0-preview-release-feature-list-and-major-changes-td28050.html
> ).
>
> We'd like to thank our contributors and users for their contributions and
> early feedback to this release. This release would not have been possible
> without you.
>
> To download Spark 3.0.0-preview, head over to the download page:
> https://archive.apache.org/dist/spark/spark-3.0.0-preview
>
> Thanks,
>
> Xingbo
>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-03 Thread Nicholas Chammas

On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran 
wrote:

> It would be really good if the spark distributions shipped with later
> versions of the hadoop artifacts.
>

I second this. If we need to keep a Hadoop 2.x profile around, why not make
it Hadoop 2.8 or something newer?

Koert Kuipers  wrote:

> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile to
> latest would probably be an issue for us.

When was the last time HDP 2.x bumped their minor version of Hadoop? Do we
want to wait for them to bump to Hadoop 2.8 before we do the same?

Spark 3.0 and S3A

2019-10-28 Thread Nicholas Chammas

Howdy folks,

I have a question about what is happening with the 3.0 release in relation
to Hadoop and hadoop-aws

.

Today, among other builds, we release a build of Spark built against Hadoop
2.7 and another one built without Hadoop. In Spark 3+, will we continue to
release Hadoop 2.7 builds as one of the primary downloads on the download
page ? Or will we start building
Spark against a newer version of Hadoop?

The reason I ask is because successive versions of hadoop-aws have made
significant usability improvements to S3A. To get those, users need to
download the Hadoop-free build of Spark
 and then link
Spark to a version of Hadoop newer than 2.7. There are various dependency
and runtime issues with trying to pair Spark built against Hadoop 2.7 with
hadoop-aws 2.8 or newer.

If we start releasing builds of Spark built against Hadoop 3.2 (or another
recent version), users can get the latest S3A improvements via --packages
"org.apache.hadoop:hadoop-aws:3.2.1" without needing to download Hadoop
separately.

Nick

Re: DSv2 sync - 4 September 2019

2019-09-09 Thread Nicholas Chammas

Ah yes, on rereading the original email I see that the sync discussion was
different. Thanks for the clarification! I’ll file a JIRA about PERMISSIVE.

2019년 9월 9일 (월) 오전 6:05, Wenchen Fan 님이 작성:

> Hi Nicholas,
>
> You are talking about a different thing. The PERMISSIVE mode is the
> failure mode for reading text-based data source (json, csv, etc.). It's not
> the general failure mode for Spark table insertion.
>
> I agree with you that the PERMISSIVE mode is hard to use. Feel free to
> open a JIRA ticket if you have some better ideas.
>
> Thanks,
> Wenchen
>
> On Mon, Sep 9, 2019 at 12:46 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> A quick question about failure modes, as a casual observer of the DSv2
>> effort:
>>
>> I was considering filing a JIRA ticket about enhancing the
>> DataFrameReader to include the failure *reason* in addition to the
>> corrupt record when the mode is PERMISSIVE. So if you are loading a CSV,
>> for example, and a value cannot be automatically cast to the type you
>> specify in the schema, you'll get the corrupt record in the column
>> configured by columnNameOfCorruptRecord, but you'll also get some detail
>> about what exactly made the record corrupt, perhaps in a new column
>> specified by something like columnNameOfCorruptReason.
>>
>> Is this an enhancement that would be possible in DSv2?
>>
>> On Fri, Sep 6, 2019 at 6:28 PM Ryan Blue 
>> wrote:
>>
>>> Here are my notes from the latest sync. Feel free to reply with
>>> clarifications if I’ve missed anything.
>>>
>>> *Attendees*:
>>>
>>> Ryan Blue
>>> John Zhuge
>>> Russell Spitzer
>>> Matt Cheah
>>> Gengliang Wang
>>> Priyanka Gomatam
>>> Holden Karau
>>>
>>> *Topics*:
>>>
>>>- DataFrameWriterV2 insert vs append (recap)
>>>- ANSI and strict modes for inserting casts
>>>- Separating identifier resolution from table lookup
>>>- Open PRs
>>>   - SHOW NAMESPACES - https://github.com/apache/spark/pull/25601
>>>   - DataFrameWriterV2 - https://github.com/apache/spark/pull/25681
>>>   - TableProvider API update -
>>>   https://github.com/apache/spark/pull/25651
>>>   - UPDATE - https://github.com/apache/spark/pull/25626
>>>
>>> *Discussion*:
>>>
>>>- DataFrameWriterV2 insert vs append discussion recapped the
>>>agreement from last sync
>>>- ANSI and strict modes for inserting casts:
>>>   - Russell: Failure modes are important. ANSI behavior is to fail
>>>   at runtime, not analysis time. If a cast is allowed, but doesn’t 
>>> throw an
>>>   exception at runtime then this can’t be considered ANSI behavior.
>>>   - Gengliang: ANSI adds the cast
>>>   - Matt: Sounds like there are two conflicting views of the world.
>>>   Is the default ANSI behavior to insert a cast that may produce NULL 
>>> or to
>>>   fail at runtime?
>>>   - Ryan: So analysis and runtime behaviors can’t be separate?
>>>   - Matt: Analysis behavior is influenced by behavior at runtime.
>>>   Maybe the vote should cover both?
>>>   - Russell: (linked to the standard) There are 3 steps: if numeric
>>>   and same type, use the data value. If the value can be rounded or
>>>   truncated, round or truncate. Otherwise, throw an exception that a 
>>> value
>>>   can’t be cast. These are runtime requirements.
>>>   - Ryan: Another consideration is that we can make Spark more
>>>   permissive, but can’t make Spark more strict in future releases.
>>>   - Matt: v1 silently corrupts data
>>>   - Russell: ANSI is fine, as long as the runtime matches (is
>>>   ANSI). Don’t tell people it’s ANSI and not do ANSI completely.
>>>   - Gengliang: people are concerned about long-running jobs failing
>>>   at the end
>>>   - Ryan: That’s okay because they can change the defaults: use
>>>   strict analysis-time validation, or allow casts to produce NULL 
>>> values.
>>>   - Matt: As long as this is well documented, it should be fine
>>>   - Ryan: Can we run tests to find out what exactly the behavior is?
>>>   - Gengliang: sqlfiddle.com
>>>   - Russell ran tests in MySQL and Postgres. Both threw runtime
>>>   failures.
>>>   - Matt: Let’s move on, but add the runtime behavior to the VOTE
>&

Re: DSv2 sync - 4 September 2019

2019-09-08 Thread Nicholas Chammas

A quick question about failure modes, as a casual observer of the DSv2
effort:

I was considering filing a JIRA ticket about enhancing the DataFrameReader
to include the failure *reason* in addition to the corrupt record when the
mode is PERMISSIVE. So if you are loading a CSV, for example, and a value
cannot be automatically cast to the type you specify in the schema, you'll
get the corrupt record in the column configured by
columnNameOfCorruptRecord, but you'll also get some detail about what
exactly made the record corrupt, perhaps in a new column specified by
something like columnNameOfCorruptReason.

Is this an enhancement that would be possible in DSv2?

On Fri, Sep 6, 2019 at 6:28 PM Ryan Blue  wrote:

> Here are my notes from the latest sync. Feel free to reply with
> clarifications if I’ve missed anything.
>
> *Attendees*:
>
> Ryan Blue
> John Zhuge
> Russell Spitzer
> Matt Cheah
> Gengliang Wang
> Priyanka Gomatam
> Holden Karau
>
> *Topics*:
>
>- DataFrameWriterV2 insert vs append (recap)
>- ANSI and strict modes for inserting casts
>- Separating identifier resolution from table lookup
>- Open PRs
>   - SHOW NAMESPACES - https://github.com/apache/spark/pull/25601
>   - DataFrameWriterV2 - https://github.com/apache/spark/pull/25681
>   - TableProvider API update -
>   https://github.com/apache/spark/pull/25651
>   - UPDATE - https://github.com/apache/spark/pull/25626
>
> *Discussion*:
>
>- DataFrameWriterV2 insert vs append discussion recapped the agreement
>from last sync
>- ANSI and strict modes for inserting casts:
>   - Russell: Failure modes are important. ANSI behavior is to fail at
>   runtime, not analysis time. If a cast is allowed, but doesn’t throw an
>   exception at runtime then this can’t be considered ANSI behavior.
>   - Gengliang: ANSI adds the cast
>   - Matt: Sounds like there are two conflicting views of the world.
>   Is the default ANSI behavior to insert a cast that may produce NULL or 
> to
>   fail at runtime?
>   - Ryan: So analysis and runtime behaviors can’t be separate?
>   - Matt: Analysis behavior is influenced by behavior at runtime.
>   Maybe the vote should cover both?
>   - Russell: (linked to the standard) There are 3 steps: if numeric
>   and same type, use the data value. If the value can be rounded or
>   truncated, round or truncate. Otherwise, throw an exception that a value
>   can’t be cast. These are runtime requirements.
>   - Ryan: Another consideration is that we can make Spark more
>   permissive, but can’t make Spark more strict in future releases.
>   - Matt: v1 silently corrupts data
>   - Russell: ANSI is fine, as long as the runtime matches (is ANSI).
>   Don’t tell people it’s ANSI and not do ANSI completely.
>   - Gengliang: people are concerned about long-running jobs failing
>   at the end
>   - Ryan: That’s okay because they can change the defaults: use
>   strict analysis-time validation, or allow casts to produce NULL values.
>   - Matt: As long as this is well documented, it should be fine
>   - Ryan: Can we run tests to find out what exactly the behavior is?
>   - Gengliang: sqlfiddle.com
>   - Russell ran tests in MySQL and Postgres. Both threw runtime
>   failures.
>   - Matt: Let’s move on, but add the runtime behavior to the VOTE
>- Identifier resolution and table lookup
>   - Ryan: recent changes merged identifier resolution and table
>   lookup together because identifiers owned by the session catalog need 
> to be
>   loaded to find out whether to use v1 or v2 plans. I think this should be
>   separated so that identifier resolution happens independently to ensure
>   that the two separate tasks don’t end up getting done at the same time 
> and
>   over-complicating the analyzer.
>- SHOW NAMESPACES - Ready for final review
>- DataFrameWriterV2:
>   - Ryan: Tests failed after passing on the PR. Anyone know why that
>   would happen?
>   - Gengliang: tests failed in maven
>   - Holden: PR validation runs SBT tests
>- TableProvider API update: skipped because Wenchen didn’t make it
>- UPDATE support PR
>   - Ryan: There is a PR to add a SQL UPDATE command, but it delegates
>   entirely to the data source, which seems strange.
>   - Matt: What is Spark’s purpose here? Why would Spark parse a SQL
>   statement only to pass it entirely to another engine?
>   - Ryan: It does make sense to do this. If Spark eventually supports
>   MERGE INTO and other row-level operations, then it makes sense to push 
> down
>   the operation to some sources, like JDBC. I just find it backward to add
>   the pushdown API before adding an implementation that handles this 
> inside
>   Spark — pushdown is usually an optimization.
>   - Russell: Would this be safe? Spark retries lots of

Providing a namespace for third-party configurations

2019-08-30 Thread Nicholas Chammas

I discovered today that EMR provides its own optimizations for Spark
.
Some of these optimizations are controlled by configuration settings with
names like `spark.sql.dynamicPartitionPruning.enabled` or
`spark.sql.optimizer.flattenScalarSubqueriesWithAggregates.enabled`. As far
as I can tell ,
these are EMR-specific configurations.

Does this create a potential problem, since it's possible that future
Apache Spark configuration settings may end up colliding with these names
selected by EMR?

Should we document some sort of third-party configuration namespace pattern
and encourage third parties to scope their custom configurations to that
area? e.g. Something like `spark.external.[vendor].[whatever]`.

Nick

Re: Recognizing non-code contributions

2019-08-05 Thread Nicholas Chammas

On Mon, Aug 5, 2019 at 9:55 AM Sean Owen  wrote:

> On Mon, Aug 5, 2019 at 3:50 AM Myrle Krantz  wrote:
> > So... events coordinators?  I'd still make them committers.  I guess I'm
> still struggling to understand what problem making people VIP's without
> giving them committership is trying to solve.
>
> We may just agree to disagree, which is fine, but I think the argument
> is clear enough: such a person has zero need for the commit bit.
> Turning it around, what are we trying to accomplish by giving said
> person a commit bit? I know people say there's no harm, but I think
> there is at least _some_ downside. We're widening access to change
> software artifacts, the main thing that we put ASF process and checks
> around for liability reasons. I know the point is trust, and said
> person is likely to understand to never use the commit bit, but it
> brings us back to the same place. I don't wish to convince anyone else
> of my stance, though I do find it more logical, just that it's
> reasonable within The Apache Way.
>

+1 to this.

Python API for mapGroupsWithState

2019-08-02 Thread Nicholas Chammas

Can someone succinctly describe the challenge in adding the
`mapGroupsWithState()` API to PySpark?

I was hoping for some suboptimal but nonetheless working solution to be
available in Python, as there are with Python UDFs for example, but that
doesn't seem to be case. The JIRA ticket for arbitrary stateful operations
in Structured Streaming 
doesn't give any indication that a Python version of the API is coming.

Is this something that will likely be added in the near future, or is it a
major undertaking? Can someone briefly describe the problem?

Nick

Re: Suggestion on Join Approach with Spark

2019-05-15 Thread Nicholas Chammas

This kind of question is for the User list, or for something like Stack
Overflow. It's not on topic here.

The dev list (i.e. this list) is for discussions about the development of
Spark itself.

On Wed, May 15, 2019 at 1:50 PM Chetan Khatri 
wrote:

> Any one help me, I am confused. :(
>
> On Wed, May 15, 2019 at 7:28 PM Chetan Khatri 
> wrote:
>
>> Hello Spark Developers,
>>
>> I have a question on Spark Join I am doing.
>>
>> I have a full load data from RDBMS and storing at HDFS let's say,
>>
>> val historyDF = spark.read.parquet(*"/home/test/transaction-line-item"*)
>>
>> and I am getting changed data at seperate hdfs path,let's say;
>>
>> val deltaDF = spark.read.parquet("/home/test/transaction-line-item-delta")
>>
>> Now I would like to take rows from deltaDF and ignore only those records 
>> from historyDF, and write to some MySQL table.
>>
>> Once I am done with writing to MySQL table, I would like to update 
>> */home/test/transaction-line-item *as overwrite. Now I can't just
>>
>> overwrite because lazy evaluation and DAG structure unless write to 
>> somewhere else and then write back as overwrite.
>>
>> val syncDataDF = historyDF.join(deltaDF.select("TRANSACTION_BY_LINE_ID", 
>> "sys_change_column"), Seq("TRANSACTION_BY_LINE_ID"),
>>   "left_outer").filter(deltaDF.col("sys_change_column").isNull)
>> .drop(deltaDF.col("sys_change_column"))
>>
>> val mergedDataDF = syncDataDF.union(deltaDF)
>>
>> I believe, Without doing *union *, only with Join this can be done. Please 
>> suggest best approach.
>>
>> As I can't write back *mergedDataDF * to the path of historyDF, because from 
>> there I am only reading. What I am doing is to write at temp
>>
>> path and then read  from there and write back! Which is bad Idea, I need 
>> suggestion here...
>>
>>
>> mergedDataDF.write.mode(SaveMode.Overwrite).parquet("home/test/transaction-line-item-temp/")
>> val tempMergedDF = 
>> spark.read.parquet("home/test/transaction-line-item-temp/")
>> tempMergedDF.write.mode(SaveMode.Overwrite).parquet("*/home/test/transaction-line-item"*)
>>
>>
>> Please suggest me best approach.
>>
>>
>> Thanks
>>
>>
>>
>>

Re: [PySpark] Revisiting PySpark type annotations

2019-01-25 Thread Nicholas Chammas

I think the annotations are compatible with Python 2 since Maciej
implemented them via stub files
, which Python 2
simply ignores. Folks using mypy  to check types
will get the benefit whether they're on Python 2 or 3, since mypy works
with both.

On Fri, Jan 25, 2019 at 12:27 PM Reynold Xin  wrote:

> If we can make the annotation compatible with Python 2, why don’t we add
> type annotation to make life easier for users of Python 3 (with type)?
>
> On Fri, Jan 25, 2019 at 7:53 AM Maciej Szymkiewicz 
> wrote:
>
>>
>> Hello everyone,
>>
>> I'd like to revisit the topic of adding PySpark type annotations in 3.0.
>> It has been discussed before (
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html
>> and
>> http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html)
>> and is tracked by SPARK-17333 (
>> https://issues.apache.org/jira/browse/SPARK-17333). Is there any
>> consensus here?
>>
>> In the spirit of full disclosure I am trying to decide if, and if yes to
>> what extent, migrate my stub package (
>> https://github.com/zero323/pyspark-stubs) to 3.0 and beyond. Maintaining
>> such package is relatively time consuming (not being active PySpark user
>> anymore, it is the least priority for me at the moment) and if there any
>> official plans to make it obsolete, it would be a valuable information for
>> me.
>>
>> If there are no plans to add native annotations to PySpark, I'd like to
>> use this opportunity to ask PySpark commiters, to drop by and open issue (
>> https://github.com/zero323/pyspark-stubs/issues)  when new methods are
>> introduced, or there are changes in the existing API (PR's are of course
>> welcomed as well). Thanks in advance.
>>
>> --
>> Best,
>> Maciej
>>
>>
>>

Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread Nicholas Chammas

OK, good to know, and that all makes sense. Thanks for clearing up my
concern.

One of great things about Spark is, as you pointed out, that improvements
to core components benefit multiple features at once.

On Mon, Jan 14, 2019 at 8:36 PM Reynold Xin  wrote:

> BTW the largest change to SS right now is probably the entire data source
> API v2 effort, which aims to unify streaming and batch from data source
> perspective, and provide a reliable, expressive source/sink API.
>
>
> On Mon, Jan 14, 2019 at 5:34 PM, Reynold Xin  wrote:
>
>> There are a few things to keep in mind:
>>
>> 1. Structured Streaming isn't an independent project. It actually (by
>> design) depends on all the rest of Spark SQL, and virtually all
>> improvements to Spark SQL benefit Structured Streaming.
>>
>> 2. The project as far as I can tell is relatively mature for core ETL and
>> incremental processing purpose. I interact with a lot of users using it
>> everyday. We can always expand the use cases and add more, but that also
>> adds maintenance burden. In any case, it'd be good to get some activity
>> here.
>>
>>
>>
>>
>> On Mon, Jan 14, 2019 at 5:11 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> As an observer, this thread is interesting and concerning. Is there an
>>> emerging consensus that Structured Streaming is somehow not relevant
>>> anymore? Or is it just that folks consider it "complete enough"?
>>>
>>> Structured Streaming was billed as the replacement to DStreams. If
>>> committers, generally speaking, have lost interest in Structured Streaming,
>>> does that mean the Apache Spark project is somehow no longer aiming to
>>> provide a "first-class" solution to the problem of stream processing?
>>>
>>> On Mon, Jan 14, 2019 at 3:43 PM Jungtaek Lim  wrote:
>>>
>>>> Cody, I guess I already addressed your comments in the PR (#22138). The
>>>> approach was changed to address your concern, and after that Gabor helped
>>>> to review the PR. Please take a look again when you have time to get into.
>>>>
>>>> 2019년 1월 15일 (화) 오전 1:01, Cody Koeninger 님이 작성:
>>>>
>>>>> I feel like I've already said my piece on
>>>>> https://github.com/apache/spark/pull/22138 let me know if you have
>>>>> more questions.
>>>>>
>>>>> As for SS in general, I don't have a production SS deployment, so I'm
>>>>> less comfortable with reviewing large changes to it.  But if no other
>>>>> committers are working on it...
>>>>>
>>>>> On Sun, Jan 13, 2019 at 5:19 PM Sean Owen  wrote:
>>>>> >
>>>>> > Yes you're preaching to the choir here. SS does seem somewhat
>>>>> > abandoned by those that have worked on it. I have also been at times
>>>>> > frustrated that some areas fall into this pattern.
>>>>> >
>>>>> > There isn't a way to make people work on it, and I personally am not
>>>>> > interested in it nor have a background in SS.
>>>>> >
>>>>> > I did leave some comments on your PR and will see if we can get
>>>>> > comfortable with merging it, as I presume you are pretty
>>>>> knowledgeable
>>>>> > about the change.
>>>>> >
>>>>> > On Sun, Jan 13, 2019 at 4:55 PM Jungtaek Lim 
>>>>> wrote:
>>>>> > >
>>>>> > > Sean, this is actually a fail-back on pinging committers. I know
>>>>> who can review and merge in SS area, and pinged to them, didn't work. Even
>>>>> there's a PR which approach was encouraged by committer and reviewed the
>>>>> first phase, and no review.
>>>>> > >
>>>>> > > That's not the first time I have faced the situation, and I used
>>>>> the fail-back approach at that time. (You can see there was no response
>>>>> even in the mail thread.) Not sure which approach worked.
>>>>> > >
>>>>> https://lists.apache.org/thread.html/c61f32249949b1ff1b265c1a7148c2ea7eda08891e3016fb24008561@%3Cdev.spark.apache.org%3E
>>>>> > >
>>>>> > > I've observed that only (critical) bugfixes are being reviewed and
>>>>> merged in time for SS area. For other stuffs like new features and
>>>>> improvements, both discussions and PRs were pretty less popular from
>>>&g

Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread Nicholas Chammas

As an observer, this thread is interesting and concerning. Is there an
emerging consensus that Structured Streaming is somehow not relevant
anymore? Or is it just that folks consider it "complete enough"?

Structured Streaming was billed as the replacement to DStreams. If
committers, generally speaking, have lost interest in Structured Streaming,
does that mean the Apache Spark project is somehow no longer aiming to
provide a "first-class" solution to the problem of stream processing?

On Mon, Jan 14, 2019 at 3:43 PM Jungtaek Lim  wrote:

> Cody, I guess I already addressed your comments in the PR (#22138). The
> approach was changed to address your concern, and after that Gabor helped
> to review the PR. Please take a look again when you have time to get into.
>
> 2019년 1월 15일 (화) 오전 1:01, Cody Koeninger 님이 작성:
>
>> I feel like I've already said my piece on
>> https://github.com/apache/spark/pull/22138 let me know if you have
>> more questions.
>>
>> As for SS in general, I don't have a production SS deployment, so I'm
>> less comfortable with reviewing large changes to it.  But if no other
>> committers are working on it...
>>
>> On Sun, Jan 13, 2019 at 5:19 PM Sean Owen  wrote:
>> >
>> > Yes you're preaching to the choir here. SS does seem somewhat
>> > abandoned by those that have worked on it. I have also been at times
>> > frustrated that some areas fall into this pattern.
>> >
>> > There isn't a way to make people work on it, and I personally am not
>> > interested in it nor have a background in SS.
>> >
>> > I did leave some comments on your PR and will see if we can get
>> > comfortable with merging it, as I presume you are pretty knowledgeable
>> > about the change.
>> >
>> > On Sun, Jan 13, 2019 at 4:55 PM Jungtaek Lim  wrote:
>> > >
>> > > Sean, this is actually a fail-back on pinging committers. I know who
>> can review and merge in SS area, and pinged to them, didn't work. Even
>> there's a PR which approach was encouraged by committer and reviewed the
>> first phase, and no review.
>> > >
>> > > That's not the first time I have faced the situation, and I used the
>> fail-back approach at that time. (You can see there was no response even in
>> the mail thread.) Not sure which approach worked.
>> > >
>> https://lists.apache.org/thread.html/c61f32249949b1ff1b265c1a7148c2ea7eda08891e3016fb24008561@%3Cdev.spark.apache.org%3E
>> > >
>> > > I've observed that only (critical) bugfixes are being reviewed and
>> merged in time for SS area. For other stuffs like new features and
>> improvements, both discussions and PRs were pretty less popular from
>> committers: though there was even participation/approve from non-committer
>> community. I don't think SS is the thing to be turned into maintenance.
>> > >
>> > > I guess PMC members should try to resolve such situation, as it will
>> (slowly and quietly) make some issues like contributors leaving, module
>> stopped growing up, etc.. The problem will grow up like a snowball: getting
>> bigger and bigger. I don't mind if there's no interest on both contributors
>> and committers for such module, but SS is not. Maybe either other
>> committers who weren't familiar with should try to get familiar and cover
>> the area, or the area needs more committers.
>> > >
>> > > -Jungtaek Lim (HeartSaVioR)
>> > >
>> > > 2019년 1월 13일 (일) 오후 11:37, Sean Owen 님이 작성:
>> > >>
>> > >> Jungtaek, the best strategy is to find who wrote the code you are
>> > >> modifying (use Github history or git blame) and ping them directly on
>> > >> the PR. I don't know this code well myself.
>> > >> It also helps if you can address why the functionality is important,
>> > >> and describe compatibility implications.
>> > >>
>> > >> Most PRs are not merged, note. Not commenting on this particular one,
>> > >> but it's not a 'bug' if it's not being merged.
>> > >>
>> > >> On Sun, Jan 13, 2019 at 12:29 AM Jungtaek Lim 
>> wrote:
>> > >> >
>> > >> > I'm sorry but let me remind this, as non-SS PRs are being reviewed
>> accordingly, whereas many of SS PRs (regardless of who create) are still
>> not reviewed and merged in time.
>> > >> >
>> > >> > 2019년 1월 3일 (목) 오전 7:57, Jungtaek Lim 님이 작성:
>> > >> >>
>> > >> >> Spark devs, happy new year!
>> > >> >>
>> > >> >> I would like to remind this kindly, since there was actually no
>> review after initiating the thread.
>> > >> >>
>> > >> >> Thanks,
>> > >> >> Jungtaek Lim (HeartSaVioR)
>> > >> >>
>> > >> >> 2018년 12월 12일 (수) 오후 11:12, Vaclav Kosar 님이
>> 작성:
>> > >> >>>
>> > >> >>> I am also waiting for any finalization of my PR [3]. I seems
>> that SS PRs are not being reviewed much these days.
>> > >> >>>
>> > >> >>> [3] https://github.com/apache/spark/pull/21919
>> > >> >>>
>> > >> >>>
>> > >> >>> On 12. 12. 18 14:37, Dongjin Lee wrote:
>> > >> >>>
>> > >> >>> If it is possible, could you review my PR on Kafka's header
>> functionality[^1] also? It was added in Kafka 0.11.0.0 but still not
>> supported in Spark.
>> > >> >>>
>> > >> >>> Thanks,
>> >

Re: Noisy spark-website notifications

2018-12-19 Thread Nicholas Chammas

I'd prefer it if we disabled all git notifications for spark-website. Folks
who want to stay on top of what's happening with the site can simply watch
the repo on GitHub <https://github.com/apache/spark-website>, no?

On Wed, Dec 19, 2018 at 10:00 PM Wenchen Fan  wrote:

> +1, at least it should only send one email when a PR is merged.
>
> On Thu, Dec 20, 2018 at 10:58 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Can we somehow disable these new email alerts coming through for the
>> Spark website repo?
>>
>> On Wed, Dec 19, 2018 at 8:25 PM GitBox  wrote:
>>
>>> ueshin commented on a change in pull request #163: Announce the schedule
>>> of 2019 Spark+AI summit at SF
>>> URL:
>>> https://github.com/apache/spark-website/pull/163#discussion_r243130975
>>>
>>>
>>>
>>>  ##
>>>  File path: site/sitemap.xml
>>>  ##
>>>  @@ -139,657 +139,661 @@
>>>  
>>>  
>>>  
>>> -  https://spark.apache.org/releases/spark-release-2-4-0.html
>>> +  
>>> http://localhost:4000/news/spark-ai-summit-apr-2019-agenda-posted.html
>>> 
>>>
>>>  Review comment:
>>>Still remaining `localhost:4000` in this file.
>>>
>>> 
>>> This is an automated message from the Apache Git Service.
>>> To respond to the message, please log on GitHub and use the
>>> URL above to go to the specific comment.
>>>
>>> For queries about this service, please contact Infrastructure at:
>>> us...@infra.apache.org
>>>
>>>
>>> With regards,
>>> Apache Git Services
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Noisy spark-website notifications

2018-12-19 Thread Nicholas Chammas

Can we somehow disable these new email alerts coming through for the Spark
website repo?

On Wed, Dec 19, 2018 at 8:25 PM GitBox  wrote:

> ueshin commented on a change in pull request #163: Announce the schedule
> of 2019 Spark+AI summit at SF
> URL:
> https://github.com/apache/spark-website/pull/163#discussion_r243130975
>
>
>
>  ##
>  File path: site/sitemap.xml
>  ##
>  @@ -139,657 +139,661 @@
>  
>  
>  
> -  https://spark.apache.org/releases/spark-release-2-4-0.html
> +  
> http://localhost:4000/news/spark-ai-summit-apr-2019-agenda-posted.html
> 
>
>  Review comment:
>Still remaining `localhost:4000` in this file.
>
> 
> This is an automated message from the Apache Git Service.
> To respond to the message, please log on GitHub and use the
> URL above to go to the specific comment.
>
> For queries about this service, please contact Infrastructure at:
> us...@infra.apache.org
>
>
> With regards,
> Apache Git Services
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas

On Tue, 23 Oct 2018 at 21:32, Sean Owen  wrote:
>
>> The comments say that it is not possible to overload 'and' and 'or',
>> which would have been more natural.
>>
> Yes, unfortunately, Python does not allow you to override and, or, or not.
They are not implemented as “dunder” method (e.g. __add__()) and they
implement special short-circuiting logic that’s not possible to reproduce
with a function call. I think we made the most practical choice in
overriding the bitwise operators.

In any case, I’ll file a JIRA ticket about this, and maybe also submit a PR
to close it, adding documentation about PySpark column boolean operators to
the programming guide.

Nick

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas

Also, to clarify something for folks who don't work with PySpark: The
boolean column operators in PySpark are completely different from those in
Scala, and non-obvious to boot (since they overload Python's _bitwise_
operators). So their apparent absence from the docs is surprising.

On Tue, Oct 23, 2018 at 3:02 PM Nicholas Chammas 
wrote:

> So it appears then that the equivalent operators for PySpark are
> completely missing from the docs, right? That’s surprising. And if there
> are column function equivalents for |, &, and ~, then I can’t find those
> either for PySpark. Indeed, I don’t think such a thing is possible in
> PySpark. (e.g. (col('age') > 0).and(...))
>
> I can file a ticket about this, but I’m just making sure I’m not missing
> something obvious.
>
> On Tue, Oct 23, 2018 at 2:50 PM Sean Owen  wrote:
>
>> Those should all be Column functions, really, and I see them at
>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column
>>
>> On Tue, Oct 23, 2018, 12:27 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I can’t seem to find any documentation of the &, |, and ~ operators for
>>> PySpark DataFrame columns. I assume that should be in our docs somewhere.
>>>
>>> Was it always missing? Am I just missing something obvious?
>>>
>>> Nick
>>>
>>

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas

So it appears then that the equivalent operators for PySpark are completely
missing from the docs, right? That’s surprising. And if there are column
function equivalents for |, &, and ~, then I can’t find those either for
PySpark. Indeed, I don’t think such a thing is possible in PySpark.
(e.g. (col('age')
> 0).and(...))

I can file a ticket about this, but I’m just making sure I’m not missing
something obvious.

On Tue, Oct 23, 2018 at 2:50 PM Sean Owen  wrote:

> Those should all be Column functions, really, and I see them at
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column
>
> On Tue, Oct 23, 2018, 12:27 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I can’t seem to find any documentation of the &, |, and ~ operators for
>> PySpark DataFrame columns. I assume that should be in our docs somewhere.
>>
>> Was it always missing? Am I just missing something obvious?
>>
>> Nick
>>
>

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas

Nope, that’s different. I’m talking about the operators on DataFrame
columns in PySpark, not SQL functions.

For example:

(df
.where(~col('is_exiled') & (col('age') > 60))
.show()
)


On Tue, Oct 23, 2018 at 1:48 PM Xiao Li  wrote:

> They are documented at the link below
>
> https://spark.apache.org/docs/2.3.0/api/sql/index.html
>
>
>
> On Tue, Oct 23, 2018 at 10:27 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I can’t seem to find any documentation of the &, |, and ~ operators for
>> PySpark DataFrame columns. I assume that should be in our docs somewhere.
>>
>> Was it always missing? Am I just missing something obvious?
>>
>> Nick
>>
>
>
> --
> [image: Spark+AI Summit North America 2019]
> <http://t.sidekickopen24.com/s1t/c/5/f18dQhb0S7lM8dDMPbW2n0x6l2B9nMJN7t5X-FfhMynN2z8MDjQsyTKW56dzQQ1-_gV6102?t=https%3A%2F%2Fdatabricks.com%2Fsparkaisummit%2Fnorth-america=undefined=406b8c9a-b648-4923-9ed1-9a51ffe213fa>
>

Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas

I can’t seem to find any documentation of the &, |, and ~ operators for
PySpark DataFrame columns. I assume that should be in our docs somewhere.

Was it always missing? Am I just missing something obvious?

Nick

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Nicholas Chammas

FYI I believe we have an open correctness issue here:

https://issues.apache.org/jira/browse/SPARK-25150

However, it needs review by another person to confirm whether it is indeed
a correctness issue (and whether it still impacts this latest RC).

Nick

2018년 10월 10일 (수) 오후 3:14, Jean Georges Perrin 님이 작성:

> Awesome - thanks Dongjoon!
>
>
> On Oct 10, 2018, at 2:36 PM, Dongjoon Hyun 
> wrote:
>
> For now, you can see generated release notes. Official one will be posted
> on the website when the official 2.4.0 is out.
>
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420=12342385
>
> Bests,
> Dongjoon.
>
>
> On Wed, Oct 10, 2018 at 11:29 AM Jean Georges Perrin  wrote:
>
>> Hi,
>>
>> Sorry if it's stupid question, but where can I find the release notes of
>> 2.4.0?
>>
>> jg
>>
>> On Oct 10, 2018, at 2:00 PM, Imran Rashid 
>> wrote:
>>
>> Sorry I had messed up my testing earlier, so I only just discovered
>> https://issues.apache.org/jira/browse/SPARK-25704
>>
>> I dont' think this is a release blocker, because its not a regression and
>> there is a workaround, just fyi.
>>
>> On Wed, Oct 10, 2018 at 11:47 AM Wenchen Fan  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.4.0.
>>>
>>> The vote is open until October 1 PST and passes if a majority +1 PMC
>>> votes are cast, with
>>> a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.4.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.4.0-rc3 (commit
>>> 8e4a99bd201b9204fec52580f19ae70a229ed94e):
>>> https://github.com/apache/spark/tree/v2.4.0-rc3
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1289
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/
>>>
>>> The list of bug fixes going into 2.4.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 2.4.0?
>>> ===
>>>
>>> The current list of open tickets targeted at 2.4.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 2.4.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>
>>
>

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Nicholas Chammas

I believe -1 votes are merited only for correctness bugs and regressions
since the previous release.

Does SPARK-23200 count as either?

2018년 9월 17일 (월) 오전 9:40, Stavros Kontopoulos <
stavros.kontopou...@lightbend.com>님이 작성:

> -1
>
> I would like to see: https://github.com/apache/spark/pull/22392 in, as
> discussed here: https://issues.apache.org/jira/browse/SPARK-23200. It is
> important IMHO for streaming on K8s.
> I just started testing it btw.
>
> Also 2.12.7(https://contributors.scala-lang.org/t/2-12-7-release/2301,
> https://github.com/scala/scala/milestone/73 is coming out (will be staged
> this week), do we want to build the beta 2.12 build against it?
>
> Stavros
>
> On Mon, Sep 17, 2018 at 8:00 AM, Wenchen Fan  wrote:
>
>> I confirmed that
>> https://repository.apache.org/content/repositories/orgapachespark-1285
>> is not accessible. I did it via ./dev/create-release/do-release-docker.sh
>> -d /my/work/dir -s publish , not sure what's going wrong. I didn't see
>> any error message during it.
>>
>> Any insights are appreciated! So that I can fix it in the next RC. Thanks!
>>
>> On Mon, Sep 17, 2018 at 11:31 AM Sean Owen  wrote:
>>
>>> I think one build is enough, but haven't thought it through. The
>>> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
>>> best advertised as a 'beta'. So maybe publish a no-hadoop build of it?
>>> Really, whatever's the easy thing to do.
>>> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
>>> wrote:
>>> >
>>> > Ah I missed the Scala 2.12 build. Do you mean we should publish a
>>> Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: with
>>> hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for
>>> Scala 2.12?
>>> >
>>> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen  wrote:
>>> >>
>>> >> A few preliminary notes:
>>> >>
>>> >> Wenchen for some weird reason when I hit your key in gpg --import, it
>>> >> asks for a passphrase. When I skip it, it's fine, gpg can still verify
>>> >> the signature. No issue there really.
>>> >>
>>> >> The staging repo gives a 404:
>>> >>
>>> https://repository.apache.org/content/repositories/orgapachespark-1285/
>>> >> 404 - Repository "orgapachespark-1285 (staging: open)"
>>> >> [id=orgapachespark-1285] exists but is not exposed.
>>> >>
>>> >> The (revamped) licenses are OK, though there are some minor glitches
>>> >> in the final release tarballs (my fault) : there's an extra directory,
>>> >> and the source release has both binary and source licenses. I'll fix
>>> >> that. Not strictly necessary to reject the release over those.
>>> >>
>>> >> Last, when I check the staging repo I'll get my answer, but, were you
>>> >> able to build 2.12 artifacts as well?
>>> >>
>>> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan 
>>> wrote:
>>> >> >
>>> >> > Please vote on releasing the following candidate as Apache Spark
>>> version 2.4.0.
>>> >> >
>>> >> > The vote is open until September 20 PST and passes if a majority +1
>>> PMC votes are cast, with
>>> >> > a minimum of 3 +1 votes.
>>> >> >
>>> >> > [ ] +1 Release this package as Apache Spark 2.4.0
>>> >> > [ ] -1 Do not release this package because ...
>>> >> >
>>> >> > To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>> >> >
>>> >> > The tag to be voted on is v2.4.0-rc1 (commit
>>> 1220ab8a0738b5f67dc522df5e3e77ffc83d207a):
>>> >> > https://github.com/apache/spark/tree/v2.4.0-rc1
>>> >> >
>>> >> > The release files, including signatures, digests, etc. can be found
>>> at:
>>> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-bin/
>>> >> >
>>> >> > Signatures used for Spark RCs can be found in this file:
>>> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >> >
>>> >> > The staging repository for this release can be found at:
>>> >> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1285/
>>> >> >
>>> >> > The documentation corresponding to this release can be found at:
>>> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-docs/
>>> >> >
>>> >> > The list of bug fixes going into 2.4.0 can be found at the
>>> following URL:
>>> >> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.0
>>> >> >
>>> >> > FAQ
>>> >> >
>>> >> > =
>>> >> > How can I help test this release?
>>> >> > =
>>> >> >
>>> >> > If you are a Spark user, you can help us test this release by taking
>>> >> > an existing Spark workload and running on this release candidate,
>>> then
>>> >> > reporting any regressions.
>>> >> >
>>> >> > If you're working in PySpark you can set up a virtual env and
>>> install
>>> >> > the current RC and see if anything important breaks, in the
>>> Java/Scala
>>> >> > you can add the staging repository to your projects resolvers and
>>> test
>>> >> > with the RC (make sure to clean up the artifact cache before/after
>>> so
>>> >> > you don't end up building with a out of date RC going forward).
>>> >> >
>>> >> >

Re: Should python-2 be supported in Spark 3.0?

2018-09-15 Thread Nicholas Chammas

As Reynold pointed out, we don't have to drop Python 2 support right off
the bat. We can just deprecate it with Spark 3.0, which would allow us to
actually drop it at a later 3.x release.

On Sat, Sep 15, 2018 at 2:09 PM Erik Erlandson  wrote:

> On a separate dev@spark thread, I raised a question of whether or not to
> support python 2 in Apache Spark, going forward into Spark 3.0.
>
> Python-2 is going EOL  at
> the end of 2019. The upcoming release of Spark 3.0 is an opportunity to
> make breaking changes to Spark's APIs, and so it is a good time to consider
> support for Python-2 on PySpark.
>
> Key advantages to dropping Python 2 are:
>
>- Support for PySpark becomes significantly easier.
>- Avoid having to support Python 2 until Spark 4.0, which is likely to
>imply supporting Python 2 for some time after it goes EOL.
>
> (Note that supporting python 2 after EOL means, among other things, that
> PySpark would be supporting a version of python that was no longer
> receiving security patches)
>
> The main disadvantage is that PySpark users who have legacy python-2 code
> would have to migrate their code to python 3 to take advantage of Spark 3.0
>
> This decision obviously has large implications for the Apache Spark
> community and we want to solicit community feedback.
>
>

Re: Python friendly API for Spark 3.0

2018-09-14 Thread Nicholas Chammas

Do we need to ditch Python 2 support to provide type hints? I don’t think
so.

Python lets you specify typing stubs that provide the same benefit without
forcing Python 3.

2018년 9월 14일 (금) 오후 8:01, Holden Karau 님이 작성:

>
>
> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:
>
>> To be clear, is this about "python-friendly API" or "friendly python API"
>> ?
>>
> Well what would you consider to be different between those two statements?
> I think it would be good to be a bit more explicit, but I don't think we
> should necessarily limit ourselves.
>
>>
>> On the python side, it might be nice to take advantage of static typing.
>> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
>> good opportunity to jump the python-3-only train.
>>
> I think we can make types sort of work without ditching 2 (the types only
> would work in 3 but it would still function in 2). Ditching 2 entirely
> would be a big thing to consider, I honestly hadn't been considering that
> but it could be from just spending so much time maintaining a 2/3 code
> base. I'd suggest reaching out to to user@ before making that kind of
> change.
>
>>
>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
>> wrote:
>>
>>> Since we're talking about Spark 3.0 in the near future (and since some
>>> recent conversation on a proposed change reminded me) I wanted to open up
>>> the floor and see if folks have any ideas on how we could make a more
>>> Python friendly API for 3.0? I'm planning on taking some time to look at
>>> other systems in the solution space and see what we might want to learn
>>> from them but I'd love to hear what other folks are thinking too.
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>

1 2 3 4 5 >

1 - 100 of 482 matches

Mail list logo