from:"Nicholas Chammas"

Re: Validate spark sql

2023-12-24 Thread Nicholas Chammas

This is a user-list question, not a dev-list question. Moving this conversation 
to the user list and BCC-ing the dev list.

Also, this statement

> We are not validating against table or column existence.

is not correct. When you call spark.sql(…), Spark will lookup the table 
references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.

Also, when you run DDL via spark.sql(…), Spark will actually run it. So 
spark.sql(“drop table my_table”) will actually drop my_table. It’s not a 
validation-only operation.

This question of validating SQL is already discussed on Stack Overflow 
. You may find some useful tips 
there.

Nick


> On Dec 24, 2023, at 4:52 AM, Mich Talebzadeh  
> wrote:
> 
>   
> Yes, you can validate the syntax of your PySpark SQL queries without 
> connecting to an actual dataset or running the queries on a cluster.
> PySpark provides a method for syntax validation without executing the query. 
> Something like below
>   __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.4.0
>   /_/
> 
> Using Python version 3.9.16 (main, Apr 24 2023 10:36:11)
> Spark context Web UI available at http://rhes75:4040 
> Spark context available as 'sc' (master = local[*], app id = 
> local-1703410019374).
> SparkSession available as 'spark'.
> >>> from pyspark.sql import SparkSession
> >>> spark = SparkSession.builder.appName("validate").getOrCreate()
> 23/12/24 09:28:02 WARN SparkSession: Using an existing Spark session; only 
> runtime SQL configurations will take effect.
> >>> sql = "SELECT * FROM  WHERE  = some value"
> >>> try:
> ...   spark.sql(sql)
> ...   print("is working")
> ... except Exception as e:
> ...   print(f"Syntax error: {e}")
> ...
> Syntax error:
> [PARSE_SYNTAX_ERROR] Syntax error at or near '<'.(line 1, pos 14)
> 
> == SQL ==
> SELECT * FROM  WHERE  = some value
> --^^^
> 
> Here we only check for syntax errors and not the actual existence of query 
> semantics. We are not validating against table or column existence.
> 
> This method is useful when you want to catch obvious syntax errors before 
> submitting your PySpark job to a cluster, especially when you don't have 
> access to the actual data.
> In summary
> Theis method validates syntax but will not catch semantic errors
> If you need more comprehensive validation, consider using a testing framework 
> and a small dataset.
> For complex queries, using a linter or code analysis tool can help identify 
> potential issues.
> HTH
> 
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
> 
>view my Linkedin profile 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Sun, 24 Dec 2023 at 07:57, ram manickam  > wrote:
>> Hello,
>> Is there a way to validate pyspark sql to validate only syntax errors?. I 
>> cannot connect do actual data set to perform this validation.  Any help 
>> would be appreciated.
>> 
>> 
>> Thanks
>> Ram

[jira] [Created] (SPARK-46449) Add ability to create databases via Catalog API

2023-12-18 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46449:


 Summary: Add ability to create databases via Catalog API
 Key: SPARK-46449
 URL: https://issues.apache.org/jira/browse/SPARK-46449
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Nicholas Chammas


As of Spark 3.5, the only way to create a database is via SQL. The Catalog API 
should offer an equivalent.

Perhaps something like:
{code:python}
spark.catalog.createDatabase(
name: str,
existsOk: bool = False,
comment: str = None,
location: str = None,
properties: dict = None,
)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46437) Remove unnecessary cruft from SQL built-in functions docs

2023-12-17 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46437:


 Summary: Remove unnecessary cruft from SQL built-in functions docs
 Key: SPARK-46437
 URL: https://issues.apache.org/jira/browse/SPARK-46437
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.5.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Guidance for filling out "Affects Version" on Jira

2023-12-17 Thread Nicholas Chammas

The Contributing guide  only 
mentions what to fill in for “Affects Version” for bugs. How about for 
improvements?

This question once caused some problems when I set “Affects Version” to the 
last released version, and that was interpreted as a request to backport an 
improvement, which was not my intention and caused a minor kerfuffle.

Could we provide some clarity on the Contributing guide on how to fill in this 
field for improvements, especially since “Affects Version” is required on all 
Jira tickets?

Nick

[jira] [Created] (SPARK-46395) Automatically generate SQL configuration tables for documentation

2023-12-13 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46395:


 Summary: Automatically generate SQL configuration tables for 
documentation
 Key: SPARK-46395
 URL: https://issues.apache.org/jira/browse/SPARK-46395
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.5.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas

Where exactly are you getting this information from?

As far as I can tell, spark.sql.cbo.enabled has defaulted to false since it was 
introduced 7 years ago 
<https://github.com/apache/spark/commit/ae83c211257c508989c703d54f2aeec8b2b5f14d#diff-9ed2b0b7829b91eafb43e040a15247c90384e42fea1046864199fbad77527bb5R649>.
 It has never been enabled by default.

And I cannot see mention of spark.sql.cbo.strategy anywhere at all in the code 
base.

So again, where is this information coming from? Please link directly to your 
source.



> On Dec 11, 2023, at 5:45 PM, Mich Talebzadeh  
> wrote:
> 
> You are right. By default CBO is not enabled. Whilst the CBO was the default 
> optimizer in earlier versions of Spark, it has been replaced by the AQE in 
> recent releases.
> 
> spark.sql.cbo.strategy
> 
> As I understand, The spark.sql.cbo.strategy configuration property specifies 
> the optimizer strategy used by Spark SQL to generate query execution plans. 
> There are two main optimizer strategies available:
> CBO (Cost-Based Optimization): The default optimizer strategy, which analyzes 
> the query plan and estimates the execution costs associated with each 
> operation. It uses statistics to guide its decisions, selecting the plan with 
> the lowest estimated cost.
> 
> CBO-Like (Cost-Based Optimization-Like): A simplified optimizer strategy that 
> mimics some of the CBO's logic, but without the ability to estimate costs. 
> This strategy is faster than CBO for simple queries, but may not produce the 
> most efficient plan for complex queries.
> 
> The spark.sql.cbo.strategy property can be set to either CBO or CBO-Like. The 
> default value is AUTO, which means that Spark will automatically choose the 
> most appropriate strategy based on the complexity of the query and 
> availability of statistic
> 
> 
> 
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
> 
>view my Linkedin profile 
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Mon, 11 Dec 2023 at 17:11, Nicholas Chammas  <mailto:nicholas.cham...@gmail.com>> wrote:
>> 
>>> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh >> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>> By default, the CBO is enabled in Spark.
>> 
>> Note that this is not correct. AQE is enabled 
>> <https://github.com/apache/spark/blob/8235f1d56bf232bb713fe24ff6f2ffdaf49d2fcc/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L664-L669>
>>  by default, but CBO isn’t 
>> <https://github.com/apache/spark/blob/8235f1d56bf232bb713fe24ff6f2ffdaf49d2fcc/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2694-L2699>.

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas


> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh  
> wrote:
> spark.sql.cbo.strategy: Set to AUTO to use the CBO as the default optimizer, 
> or NONE to disable it completely.
> 
Hmm, I’ve also never heard of this setting before and can’t seem to find it in 
the Spark docs or source code.

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas


> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh  
> wrote:
> 
> By default, the CBO is enabled in Spark.

Note that this is not correct. AQE is enabled 

 by default, but CBO isn’t 
.

[jira] [Commented] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2023-12-10 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795162#comment-17795162
 ] 

Nicholas Chammas commented on SPARK-45599:
--

Per the [contributing guide|https://spark.apache.org/contributing.html], I 
suggest the {{correctness}} label instead of {{{}data-corruption{}}}.

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: data-corruption
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-358851823199

[jira] [Created] (SPARK-46357) Replace use of setConf with conf.set in docs

2023-12-10 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46357:


 Summary: Replace use of setConf with conf.set in docs
 Key: SPARK-46357
 URL: https://issues.apache.org/jira/browse/SPARK-46357
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.5.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: When and how does Spark use metastore statistics?

2023-12-10 Thread Nicholas Chammas

I’ve done some reading and have a slightly better understanding of statistics 
now.

Every implementation of LeafNode.computeStats 
<https://github.com/apache/spark/blob/7cea52c96f5be1bc565a033bfd77370ab5527a35/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L210>
 offers its own way to get statistics:

LocalRelation 
<https://github.com/apache/spark/blob/8ff6b7a04cbaef9c552789ad5550ceab760cb078/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala#L97>
 estimates the size of the relation directly from the row count.
HiveTableRelation 
<https://github.com/apache/spark/blob/8e95929ac4238d02dca379837ccf2fbc1cd1926d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L923-L929>
 pulls those statistics from the catalog or metastore.
DataSourceV2Relation 
<https://github.com/apache/spark/blob/5fec76dc8db2499b0a9d76231f9a250871d59658/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L100>
 delegates the job of computing statistics to the underlying data source.
There are a lot of details I’m still fuzzy on, but I think that’s the gist of 
things.

Would it make sense to add a paragraph or two to the SQL performance tuning 
page <https://spark.apache.org/docs/latest/sql-performance-tuning.html> 
covering statistics at a high level? Something that briefly explains:

what statistics are and how Spark uses them to optimize plans
the various ways Spark computes or loads statistics (catalog, data source, 
runtime, etc.)
how to gather catalog statistics (i.e. pointer to ANALYZE TABLE)
how to check statistics on an object (i.e. DESCRIBE EXTENDED) and as part of an 
optimized plan (i.e. .explain(mode="cost"))
what the cost-based optimizer does and how to enable it
Would this be a welcome addition to the project’s documentation? I’m happy to 
work on this.

> On Dec 5, 2023, at 12:12 PM, Nicholas Chammas  
> wrote:
> 
> I’m interested in improving some of the documentation relating to the table 
> and column statistics that get stored in the metastore, and how Spark uses 
> them.
> 
> But I’m not clear on a few things, so I’m writing to you with some questions.
> 
> 1. The documentation for spark.sql.autoBroadcastJoinThreshold 
> <https://spark.apache.org/docs/latest/sql-performance-tuning.html> implies 
> that it depends on table statistics to work, but it’s not clear. Is it 
> accurate to say that unless you have run ANALYZE on the tables participating 
> in a join, spark.sql.autoBroadcastJoinThreshold cannot impact the execution 
> plan?
> 
> 2. As a follow-on to the above question, the adaptive version of 
> autoBroadcastJoinThreshold, namely 
> spark.sql.adaptive.autoBroadcastJoinThreshold, may still kick in, because it 
> depends only on runtime statistics and not statistics in the metastore. Is 
> that correct? I am assuming that “runtime statistics” are gathered on the fly 
> by Spark, but I would like to mention this in the docs briefly somewhere.
> 
> 3. The documentation for spark.sql.inMemoryColumnarStorage.compressed 
> <https://spark.apache.org/docs/latest/sql-performance-tuning.html> mentions 
> “statistics”, but it’s not clear what kind of statistics we’re talking about. 
> Are those runtime statistics, metastore statistics (that depend on you 
> running ANALYZE), or something else?
> 
> 4. The documentation for ANALYZE TABLE 
> <https://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html> 
> states that the collected statistics help the optimizer "find a better query 
> execution plan”. I wish we could link to something from here with more 
> explanation. Currently, spark.sql.autoBroadcastJoinThreshold is the only 
> place where metastore statistics are explicitly referenced as impacting the 
> execution plan. Surely there must be other places, no? Would it be 
> appropriate to mention the cost-based optimizer framework 
> <https://issues.apache.org/jira/browse/SPARK-16026> somehow? It doesn’t 
> appear to have any public documentation outside of Jira.
> 
> Any pointers or information you can provide would be very helpful. Again, I 
> am interested in contributing some documentation improvements relating to 
> statistics, but there is a lot I’m not sure about.
> 
> Nick
>

Re: Algolia search on website is broken

2023-12-10 Thread Nicholas Chammas

Pinging Gengliang and Xiao about this, per these docs 
<https://github.com/apache/spark-website/blob/0ceaaaf528ec1d0201e1eab1288f37cce607268b/release-process.md#update-the-configuration-of-algolia-crawler>.

It looks like to fix this problem you need access to the Algolia Crawler Admin 
Console.

> On Dec 5, 2023, at 11:28 AM, Nicholas Chammas  
> wrote:
> 
> Should I report this instead on Jira? Apologies if the dev list is not the 
> right place.
> 
> Search on the website appears to be broken. For example, here is a search for 
> “analyze”:
> 

> 
> And here is the same search using DDG 
> <https://duckduckgo.com/?q=site:https://spark.apache.org/docs/latest/+analyze=osx=web>.
> 
> Nick
>

Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Nicholas Chammas

PyMySQL has its own implementation 
<https://github.com/PyMySQL/PyMySQL/blob/f13f054abcc18b39855a760a84be0a517f0da658/pymysql/protocol.py>
 of the MySQL client-server protocol. It does not use JDBC.


> On Dec 6, 2023, at 10:43 PM, Venkatesan Muniappan 
>  wrote:
> 
> Thanks for the advice Nicholas. 
> 
> As mentioned in the original email, I have tried JDBC + SSH Tunnel using 
> pymysql and sshtunnel and it worked fine. The problem happens only with Spark.
> 
> Thanks,
> Venkat
> 
> 
> 
> On Wed, Dec 6, 2023 at 10:21 PM Nicholas Chammas  <mailto:nicholas.cham...@gmail.com>> wrote:
>> This is not a question for the dev list. Moving dev to bcc.
>> 
>> One thing I would try is to connect to this database using JDBC + SSH 
>> tunnel, but without Spark. That way you can focus on getting the JDBC 
>> connection to work without Spark complicating the picture for you.
>> 
>> 
>>> On Dec 5, 2023, at 8:12 PM, Venkatesan Muniappan 
>>> mailto:venkatesa...@noonacademy.com>> wrote:
>>> 
>>> Hi Team,
>>> 
>>> I am facing an issue with SSH Tunneling in Apache Spark. The behavior is 
>>> same as the one in this Stackoverflow question 
>>> <https://stackoverflow.com/questions/68278369/how-to-use-pyspark-to-read-a-mysql-database-using-a-ssh-tunnel>
>>>  but there are no answers there.
>>> 
>>> This is what I am trying:
>>> 
>>> 
>>> with SSHTunnelForwarder(
>>> (ssh_host, ssh_port),
>>> ssh_username=ssh_user,
>>> ssh_pkey=ssh_key_file,
>>> remote_bind_address=(sql_hostname, sql_port),
>>> local_bind_address=(local_host_ip_address, sql_port)) as tunnel:
>>> tunnel.local_bind_port
>>> b1_semester_df = spark.read \
>>> .format("jdbc") \
>>> .option("url", b2b_mysql_url.replace("<>", 
>>> str(tunnel.local_bind_port))) \
>>> .option("query", b1_semester_sql) \
>>> .option("database", 'b2b') \
>>> .option("password", b2b_mysql_password) \
>>> .option("driver", "com.mysql.cj.jdbc.Driver") \
>>> .load()
>>> b1_semester_df.count()
>>> 
>>> Here, the b1_semester_df is loaded but when I try count on the same Df it 
>>> fails saying this
>>> 
>>> 23/12/05 11:49:17 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
>>> aborting job
>>> Traceback (most recent call last):
>>>   File "", line 1, in 
>>>   File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 382, in show
>>> print(self._jdf.showString(n, 20, vertical))
>>>   File 
>>> "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
>>> 1257, in __call__
>>>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
>>> return f(*a, **kw)
>>>   File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
>>> line 328, in get_return_value
>>> py4j.protocol.Py4JJavaError: An error occurred while calling 
>>> o284.showString.
>>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
>>> in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
>>> 2.0 (TID 11, ip-172-32-108-1.eu-central-1.compute.internal, executor 3): 
>>> com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link 
>>> failure
>>> 
>>> However, the same is working fine with pandas df. I have tried this below 
>>> and it worked.
>>> 
>>> 
>>> with SSHTunnelForwarder(
>>> (ssh_host, ssh_port),
>>> ssh_username=ssh_user,
>>> ssh_pkey=ssh_key_file,
>>> remote_bind_address=(sql_hostname, sql_port)) as tunnel:
>>> conn = pymysql.connect(host=local_host_ip_address, user=sql_username,
>>>passwd=sql_password, db=sql_main_database,
>>>port=tunnel.local_bind_port)
>>> df = pd.read_sql_query(b1_semester_sql, conn)
>>> spark.createDataFrame(df).createOrReplaceTempView("b1_semester")
>>> 
>>> So wanted to check what I am missing with my Spark usage. Please help.
>>> 
>>> Thanks,
>>> Venkat
>>> 
>>

Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Nicholas Chammas

This is not a question for the dev list. Moving dev to bcc.

One thing I would try is to connect to this database using JDBC + SSH tunnel, 
but without Spark. That way you can focus on getting the JDBC connection to 
work without Spark complicating the picture for you.


> On Dec 5, 2023, at 8:12 PM, Venkatesan Muniappan 
>  wrote:
> 
> Hi Team,
> 
> I am facing an issue with SSH Tunneling in Apache Spark. The behavior is same 
> as the one in this Stackoverflow question 
> 
>  but there are no answers there.
> 
> This is what I am trying:
> 
> 
> with SSHTunnelForwarder(
> (ssh_host, ssh_port),
> ssh_username=ssh_user,
> ssh_pkey=ssh_key_file,
> remote_bind_address=(sql_hostname, sql_port),
> local_bind_address=(local_host_ip_address, sql_port)) as tunnel:
> tunnel.local_bind_port
> b1_semester_df = spark.read \
> .format("jdbc") \
> .option("url", b2b_mysql_url.replace("<>", 
> str(tunnel.local_bind_port))) \
> .option("query", b1_semester_sql) \
> .option("database", 'b2b') \
> .option("password", b2b_mysql_password) \
> .option("driver", "com.mysql.cj.jdbc.Driver") \
> .load()
> b1_semester_df.count()
> 
> Here, the b1_semester_df is loaded but when I try count on the same Df it 
> fails saying this
> 
> 23/12/05 11:49:17 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
> aborting job
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 382, in show
> print(self._jdf.showString(n, 20, vertical))
>   File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
> line 1257, in __call__
>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o284.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 11, ip-172-32-108-1.eu-central-1.compute.internal, executor 3): 
> com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link 
> failure
> 
> However, the same is working fine with pandas df. I have tried this below and 
> it worked.
> 
> 
> with SSHTunnelForwarder(
> (ssh_host, ssh_port),
> ssh_username=ssh_user,
> ssh_pkey=ssh_key_file,
> remote_bind_address=(sql_hostname, sql_port)) as tunnel:
> conn = pymysql.connect(host=local_host_ip_address, user=sql_username,
>passwd=sql_password, db=sql_main_database,
>port=tunnel.local_bind_port)
> df = pd.read_sql_query(b1_semester_sql, conn)
> spark.createDataFrame(df).createOrReplaceTempView("b1_semester")
> 
> So wanted to check what I am missing with my Spark usage. Please help.
> 
> Thanks,
> Venkat
>

When and how does Spark use metastore statistics?

2023-12-05 Thread Nicholas Chammas

I’m interested in improving some of the documentation relating to the table and 
column statistics that get stored in the metastore, and how Spark uses them.

But I’m not clear on a few things, so I’m writing to you with some questions.

1. The documentation for spark.sql.autoBroadcastJoinThreshold 
 implies that 
it depends on table statistics to work, but it’s not clear. Is it accurate to 
say that unless you have run ANALYZE on the tables participating in a join, 
spark.sql.autoBroadcastJoinThreshold cannot impact the execution plan?

2. As a follow-on to the above question, the adaptive version of 
autoBroadcastJoinThreshold, namely 
spark.sql.adaptive.autoBroadcastJoinThreshold, may still kick in, because it 
depends only on runtime statistics and not statistics in the metastore. Is that 
correct? I am assuming that “runtime statistics” are gathered on the fly by 
Spark, but I would like to mention this in the docs briefly somewhere.

3. The documentation for spark.sql.inMemoryColumnarStorage.compressed 
 mentions 
“statistics”, but it’s not clear what kind of statistics we’re talking about. 
Are those runtime statistics, metastore statistics (that depend on you running 
ANALYZE), or something else?

4. The documentation for ANALYZE TABLE 
 
states that the collected statistics help the optimizer "find a better query 
execution plan”. I wish we could link to something from here with more 
explanation. Currently, spark.sql.autoBroadcastJoinThreshold is the only place 
where metastore statistics are explicitly referenced as impacting the execution 
plan. Surely there must be other places, no? Would it be appropriate to mention 
the cost-based optimizer framework 
 somehow? It doesn’t appear 
to have any public documentation outside of Jira.

Any pointers or information you can provide would be very helpful. Again, I am 
interested in contributing some documentation improvements relating to 
statistics, but there is a lot I’m not sure about.

Nick

Algolia search on website is broken

2023-12-05 Thread Nicholas Chammas

Should I report this instead on Jira? Apologies if the dev list is not the 
right place.

Search on the website appears to be broken. For example, here is a search for 
“analyze”:



And here is the same search using DDG 
.

Nick

[jira] [Commented] (SPARK-37571) decouple amplab jenkins from spark website, builds and tests

2023-12-05 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793347#comment-17793347
 ] 

Nicholas Chammas commented on SPARK-37571:
--

Since we've 
[retired|https://lists.apache.org/thread/5n59fs22rtytflbz4sz1pz32ozzfbkrx] the 
venerable Jenkins infrastructure, I suppose we can close this issue.

> decouple amplab jenkins from spark website, builds and tests
> 
>
> Key: SPARK-37571
> URL: https://issues.apache.org/jira/browse/SPARK-37571
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
> Attachments: audit.txt, spark-repo-to-be-audited.txt
>
>
> we will be turning off jenkins on dec 23rd, and we need to decouple the build 
> infra from jenkins, as well as remove any amplab jenkins-specific docs on the 
> website, scripts and infra setup.
> i'll be creating > 1 PRs for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37647) Expose percentile function in Scala/Python APIs

2023-12-05 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-37647.
--
Resolution: Fixed

It looks like this got added as part of Spark 3.5: 
[https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.percentile.html]

> Expose percentile function in Scala/Python APIs
> ---
>
> Key: SPARK-37647
> URL: https://issues.apache.org/jira/browse/SPARK-37647
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> SQL offers a percentile function (exact, not approximate) that is not 
> available directly in the Scala or Python DataFrame APIs.
> While it is possible to invoke SQL functions from Scala or Python via 
> {{{}expr(){}}}, I think most users expect function parity across Scala, 
> Python, and SQL. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45390) Remove `distutils` usage

2023-11-17 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17787268#comment-17787268
 ] 

Nicholas Chammas commented on SPARK-45390:
--

Ah, are you referring to [PySpark's Python 
dependencies|https://github.com/apache/spark/blob/4520f3b2da01badb506488b6ff2899babd8c709e/python/setup.py#L310-L330]
 not supporting Python 3.12?

> Remove `distutils` usage
> 
>
> Key: SPARK-45390
> URL: https://issues.apache.org/jira/browse/SPARK-45390
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [PEP-632|https://peps.python.org/pep-0632] deprecated {{distutils}} module in 
> Python {{3.10}} and dropped in Python {{3.12}} in favor of {{packaging}} 
> package.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45390) Remove `distutils` usage

2023-11-15 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786401#comment-17786401
 ] 

Nicholas Chammas commented on SPARK-45390:
--

{quote}We don't promise to support all future unreleased Python versions
{quote}
"all future unreleased versions" is a tall ask that no-one is making. :) 

The relevant circumstances here are that a) Python 3.12 is already out and the 
backwards-incompatible changes are known and [very 
limited|https://docs.python.org/3/whatsnew/3.12.html], and b) Spark 4.0 may be 
a disruptive change and so many people may remain on Spark 3.5 for longer than 
usual.

If we expect 3.5 -> 4.0 to be an easy migration, then backporting a fix like 
this to 3.5 is not as important.
{quote}we need much more validation because all Python package ecosystem should 
work there without any issues
{quote}
I'm not sure what you mean here.

Anyway, I suppose we could just wait and see. Maybe I'm wrong, but I suspect 
many users will find it surprising that Spark 3.5 doesn't work on Python 3.12, 
especially if this is the only (or close to the only) fix required.

> Remove `distutils` usage
> 
>
> Key: SPARK-45390
> URL: https://issues.apache.org/jira/browse/SPARK-45390
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [PEP-632|https://peps.python.org/pep-0632] deprecated {{distutils}} module in 
> Python {{3.10}} and dropped in Python {{3.12}} in favor of {{packaging}} 
> package.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Nicholas Chammas

I’ve always considered DataFrames to be logically equivalent to SQL tables or 
queries.

In SQL, the result order of any query is implementation-dependent without an 
explicit ORDER BY clause. Technically, you could run `SELECT * FROM table;` 10 
times in a row and get 10 different orderings.

I thought the same applied to DataFrames, but the docstring for the recently 
added method DataFrame.offset 

 implies otherwise.

This example will work fine in practice, of course. But if DataFrames are 
technically unordered without an explicit ordering clause, then in theory a 
future implementation change may result in “Bob" being the “first” row in the 
DataFrame, rather than “Tom”. That would make the example incorrect.

Is that not the case?

Nick

[jira] [Commented] (SPARK-31001) Add ability to create a partitioned table via catalog.createTable()

2022-08-31 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598403#comment-17598403
 ] 

Nicholas Chammas commented on SPARK-31001:
--

Thanks for sharing these details. This is very helpful.

Yeah, this seems like an "unofficial" answer to the original problem. It is 
helpful nonetheless, but as you said it will take a separate effort to 
formalize and document this. I agree that a formal solution will probably not 
use an option named with leading underscores.

> Add ability to create a partitioned table via catalog.createTable()
> ---
>
> Key: SPARK-31001
> URL: https://issues.apache.org/jira/browse/SPARK-31001
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> There doesn't appear to be a way to create a partitioned table using the 
> Catalog interface.
> In SQL, however, you can do this via {{{}CREATE TABLE ... PARTITIONED BY{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31001) Add ability to create a partitioned table via catalog.createTable()

2022-08-30 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598115#comment-17598115
 ] 

Nicholas Chammas commented on SPARK-31001:
--

What's {{{}__partition_columns{}}}? Is that something specific to Delta, or are 
you saying it's a hidden feature of Spark?

> Add ability to create a partitioned table via catalog.createTable()
> ---
>
> Key: SPARK-31001
> URL: https://issues.apache.org/jira/browse/SPARK-31001
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> There doesn't appear to be a way to create a partitioned table using the 
> Catalog interface.
> In SQL, however, you can do this via {{{}CREATE TABLE ... PARTITIONED BY{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Allowing all Reader or Writer settings to be provided as options

2022-08-09 Thread Nicholas Chammas

Hello people,

I want to bring some attention to SPARK-39630 
 and ask if there are any 
design objections to the idea proposed there.

The gist of the proposal is that there are some reader or writer directives 
that cannot be supplied as options, like the format, write mode, or 
partitioning settings. Allowing those directives to be specified as options too 
means that it will become possible to fully represent a reader or writer as a 
map of options and reconstruct it from that.

This makes certain workflows more natural, especially when you are trying to 
manage reader or writer configurations declaratively.

Is there some design reason not to enable this, or is it just a matter of doing 
the work?

Feel free to comment either here or on the ticket.

Nick

[jira] [Created] (SPARK-39630) Allow all Reader or Writer settings to be provided as options

2022-06-28 Thread Nicholas Chammas (Jira)

Title: Message Title


 
 
 
 

 
 
 

 
   
 Nicholas Chammas created an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Spark /  SPARK-39630  
 
 
  Allow all Reader or Writer settings to be provided as options   
 

  
 
 
 
 

 
Issue Type: 
  Improvement  
 
 
Affects Versions: 
 3.3.0  
 
 
Assignee: 
 Unassigned  
 
 
Components: 
 SQL  
 
 
Created: 
 28/Jun/22 21:03  
 
 
Priority: 
  Minor  
 
 
Reporter: 
 Nicholas Chammas  
 

  
 
 
 
 

 
 Almost all Reader or Writer settings can be provided via individual calls to `.option()` or by providing a map to `.options()`. There are notable exceptions, though, like: 
 
read/write format 
write mode 
write partitionBy, bucketBy, and sortBy 
 These settings must be provided via dedicated method calls. Why not make it so that all settings can be provided as options? Is there a design reason not to do this? Any given DataFrameReader or DataFrameWriter (along with the streaming equivalents) should be able to "export" all of its settings as a map of options, and then in turn be reconstituted entirely from that map of options. 

 

reader1 = spark.read.option("format", "parquet").option("path", "/data")
options = reader.getOptions()
reader2 = spark.read.options(options)

# reader1 and reader2 are configured identically
data1 = reader1.load()
data2

[jira] [Created] (SPARK-39582) "Since " docs on array_agg are incorrect

2022-06-24 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-39582:


 Summary: "Since " docs on array_agg are incorrect
 Key: SPARK-39582
 URL: https://issues.apache.org/jira/browse/SPARK-39582
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Nicholas Chammas


[https://spark.apache.org/docs/latest/api/sql/#array_agg]

The docs currently say "Since: 2.0.0", but `array_agg` was added in Spark 3.3.0.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37219) support AS OF syntax

2022-05-16 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537589#comment-17537589
 ] 

Nicholas Chammas commented on SPARK-37219:
--

This change will enable not just Delta, but also Iceberg to use the {{AS OF}} 
syntax, correct?

By the way, could an admin please delete the spam comments just above (and 
perhaps also ban the user if that's all they comment on here)?

> support AS OF syntax
> 
>
> Key: SPARK-37219
> URL: https://issues.apache.org/jira/browse/SPARK-37219
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>
> https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel
> Delta Lake time travel allows user to query an older snapshot of a Delta 
> table. To query an older version of a table, user needs to specify a version 
> or timestamp in a SELECT statement using AS OF syntax as the follows
> SELECT * FROM default.people10m VERSION AS OF 0;
> SELECT * FROM default.people10m TIMESTAMP AS OF '2019-01-29 00:37:58';
> This ticket is opened to add AS OF syntax in Spark



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31001) Add ability to create a partitioned table via catalog.createTable()

2022-05-10 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31001:
-
Description: 
There doesn't appear to be a way to create a partitioned table using the 
Catalog interface.

In SQL, however, you can do this via {{{}CREATE TABLE ... PARTITIONED BY{}}}.

  was:There doesn't appear to be a way to create a partitioned table using the 
Catalog interface.


> Add ability to create a partitioned table via catalog.createTable()
> ---
>
> Key: SPARK-31001
> URL: https://issues.apache.org/jira/browse/SPARK-31001
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> There doesn't appear to be a way to create a partitioned table using the 
> Catalog interface.
> In SQL, however, you can do this via {{{}CREATE TABLE ... PARTITIONED BY{}}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures

2022-04-26 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528233#comment-17528233
 ] 

Nicholas Chammas edited comment on SPARK-37222 at 4/26/22 3:44 PM:
---

I've found a helpful log setting that causes Spark to print out detailed 
information about how exactly a plan is transformed during optimization:
{code:java}
spark.conf.set("spark.sql.planChangeLog.level", "warn") {code}
Here's the log generated by enabling this setting and running Shawn's example: 
[^plan-log.log]

To confirm what Shawn noted in his comment above, it looks like the chain of 
events that results in a loop is as follows:
 # ColumnPruning
 # FoldablePropagation __
 # RemoveNoopOperators
 # PushDownLeftSemiAntiJoin
 # ColumnPruning
 # CollapseProject
 # __

What seems to be the problem is that ColumnPruning inserts some Project 
operators which are then removed successively by CollapseProject, 
RemoveNoopOperators, and PushDownLeftSemiAntiJoin.

These rules go back and forth, undoing each other's work, until 
{{spark.sql.optimizer.maxIterations}} is exhausted.


was (Author: nchammas):
I've found a helpful log setting that causes Spark to print out detailed 
information about how exactly a plan is transformed during optimization:
{code:java}
spark.conf.set("spark.sql.planChangeLog.level", "warn") {code}
Here's the log generated by enabling this setting and running Shawn's example: 
[^plan-log.log]

To confirm what Shawn noted in his comment above, it looks like the chain of 
events that results in a loop is as follows:
 # PushDownLeftSemiAntiJoin
 # ColumnPruning
 # CollapseProject
 # FoldablePropagation
 # RemoveNoopOperators
 # 

What seems to be the problem is that:
 * ColumnPruning inserts a couple of Project operators which are then removed 
by CollapseProject.
 * CollapseProject in turn pushes up the left anti-join which is then pushed 
down again by PushDownLeftSemiAntiJoin.

These three rules go back and forth, undoing each other's work, until 
{{spark.sql.optimizer.maxIterations}} is exhausted.

> Max iterations reached in Operator Optimization w/left_anti or left_semi join 
> and nested structures
> ---
>
> Key: SPARK-37222
> URL: https://issues.apache.org/jira/browse/SPARK-37222
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.2, 3.2.0, 3.2.1
> Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and 
> with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, 
> 2021.
> The problem does not occur with Spark 3.0.1.
>  
>Reporter: Shawn Smith
>Priority: Major
> Attachments: plan-log.log
>
>
> The query optimizer never reaches a fixed point when optimizing the query 
> below. This manifests as a warning:
> > WARN: Max iterations (100) reached for batch Operator Optimization before 
> > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a 
> > larger value.
> But the suggested fix won't help. The actual problem is that the optimizer 
> fails to make progress on each iteration and gets stuck in a loop.
> In practice, Spark logs a warning but continues on and appears to execute the 
> query successfully, albeit perhaps sub-optimally.
> To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and 
> 3.2.0 but not 3.0.1 it will throw an exception:
> {noformat}
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id, 'nested("n"))
> .explain()
> }
> {noformat}
> Looking at the query plan as the optimizer iterates in 
> {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations:
> {noformat}
> Project [id#2, _gen_alias_108#108L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_108#108L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> And here's the plan after one mor

[jira] [Commented] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures

2022-04-26 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528233#comment-17528233
 ] 

Nicholas Chammas commented on SPARK-37222:
--

I've found a helpful log setting that causes Spark to print out detailed 
information about how exactly a plan is transformed during optimization:
{code:java}
spark.conf.set("spark.sql.planChangeLog.level", "warn") {code}
Here's the log generated by enabling this setting and running Shawn's example: 
[^plan-log.log]

To confirm what Shawn noted in his comment above, it looks like the chain of 
events that results in a loop is as follows:
 # PushDownLeftSemiAntiJoin
 # ColumnPruning
 # CollapseProject
 # FoldablePropagation
 # RemoveNoopOperators
 # 

What seems to be the problem is that:
 * ColumnPruning inserts a couple of Project operators which are then removed 
by CollapseProject.
 * CollapseProject in turn pushes up the left anti-join which is then pushed 
down again by PushDownLeftSemiAntiJoin.

These three rules go back and forth, undoing each other's work, until 
{{spark.sql.optimizer.maxIterations}} is exhausted.

> Max iterations reached in Operator Optimization w/left_anti or left_semi join 
> and nested structures
> ---
>
> Key: SPARK-37222
> URL: https://issues.apache.org/jira/browse/SPARK-37222
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.2, 3.2.0, 3.2.1
> Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and 
> with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, 
> 2021.
> The problem does not occur with Spark 3.0.1.
>  
>Reporter: Shawn Smith
>Priority: Major
> Attachments: plan-log.log
>
>
> The query optimizer never reaches a fixed point when optimizing the query 
> below. This manifests as a warning:
> > WARN: Max iterations (100) reached for batch Operator Optimization before 
> > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a 
> > larger value.
> But the suggested fix won't help. The actual problem is that the optimizer 
> fails to make progress on each iteration and gets stuck in a loop.
> In practice, Spark logs a warning but continues on and appears to execute the 
> query successfully, albeit perhaps sub-optimally.
> To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and 
> 3.2.0 but not 3.0.1 it will throw an exception:
> {noformat}
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id, 'nested("n"))
> .explain()
> }
> {noformat}
> Looking at the query plan as the optimizer iterates in 
> {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations:
> {noformat}
> Project [id#2, _gen_alias_108#108L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_108#108L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> And here's the plan after one more iteration. You can see that all that has 
> changed is new aliases for the column in the nested column: 
> "{{_gen_alias_108#108L}}" to "{{_gen_alias_109#109L}}".
> {noformat}
> Project [id#2, _gen_alias_109#109L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_109#109L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> The optimizer continues looping and tweaking the alias until it hits the max 
> iteration count and bails out.
> Here's an example that includes a stack trace:
> {noformat}
> $ bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\

[jira] [Updated] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures

2022-04-26 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-37222:
-
Attachment: plan-log.log

> Max iterations reached in Operator Optimization w/left_anti or left_semi join 
> and nested structures
> ---
>
> Key: SPARK-37222
> URL: https://issues.apache.org/jira/browse/SPARK-37222
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.2, 3.2.0, 3.2.1
> Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and 
> with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, 
> 2021.
> The problem does not occur with Spark 3.0.1.
>  
>Reporter: Shawn Smith
>Priority: Major
> Attachments: plan-log.log
>
>
> The query optimizer never reaches a fixed point when optimizing the query 
> below. This manifests as a warning:
> > WARN: Max iterations (100) reached for batch Operator Optimization before 
> > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a 
> > larger value.
> But the suggested fix won't help. The actual problem is that the optimizer 
> fails to make progress on each iteration and gets stuck in a loop.
> In practice, Spark logs a warning but continues on and appears to execute the 
> query successfully, albeit perhaps sub-optimally.
> To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and 
> 3.2.0 but not 3.0.1 it will throw an exception:
> {noformat}
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id, 'nested("n"))
> .explain()
> }
> {noformat}
> Looking at the query plan as the optimizer iterates in 
> {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations:
> {noformat}
> Project [id#2, _gen_alias_108#108L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_108#108L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> And here's the plan after one more iteration. You can see that all that has 
> changed is new aliases for the column in the nested column: 
> "{{_gen_alias_108#108L}}" to "{{_gen_alias_109#109L}}".
> {noformat}
> Project [id#2, _gen_alias_109#109L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_109#109L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> The optimizer continues looping and tweaking the alias until it hits the max 
> iteration count and bails out.
> Here's an example that includes a stack trace:
> {noformat}
> $ bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0
>   /_/
> Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.12)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id, 'nested("n"))
> .explain()
> }
> // Exiting paste mode, now interpreting.
> ja

[jira] [Updated] (SPARK-37696) Optimizer exceeds max iterations

2022-04-25 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-37696:
-
Affects Version/s: 3.2.1

> Optimizer exceeds max iterations
> 
>
> Key: SPARK-37696
> URL: https://issues.apache.org/jira/browse/SPARK-37696
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Denis Tarima
>Priority: Minor
>
> A specific scenario causing Spark's failure in tests and a warning in 
> production:
> 21/12/20 06:45:24 WARN BaseSessionStateBuilder$$anon$2: Max iterations (100) 
> reached for batch Operator Optimization before Inferring Filters, please set 
> 'spark.sql.optimizer.maxIterations' to a larger value.
> 21/12/20 06:45:24 WARN BaseSessionStateBuilder$$anon$2: Max iterations (100) 
> reached for batch Operator Optimization after Inferring Filters, please set 
> 'spark.sql.optimizer.maxIterations' to a larger value.
>  
> To reproduce run the following commands in `spark-shell`:
> {{// define case class for a struct type in an array}}
> {{case class S(v: Int, v2: Int)}}
>  
> {{// prepare a table with an array of structs}}
> {{Seq((10, Seq(S(1, 2.toDF("i", "data").write.saveAsTable("tbl")}}
>  
> {{// select using SQL and join with a dataset using "left_anti"}}
> {{spark.sql("select i, data[size(data) - 1].v from 
> tbl").join(Seq(10).toDF("i"), Seq("i"), "left_anti").show()}}
>  
> The following conditions are required:
>  # Having additional `v2` field in `S`
>  # Using `{{{}data[size(data) - 1]{}}}` instead of `{{{}element_at(data, 
> -1){}}}`
>  # Using `{{{}left_anti{}}}` in join operation
>  
> The same behavior was observed in `master` branch and `3.1.1`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures

2022-04-25 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527527#comment-17527527
 ] 

Nicholas Chammas commented on SPARK-37222:
--

Thanks for the detailed report, [~ssmith]. I am hitting this issue as well on 
Spark 3.2.1, and your minimal test case also reproduces the issue for me.

How did you break down the optimization into its individual steps like that? 
That was very helpful.

I was able to use your breakdown to work around the issue by excluding 
{{{}PushDownLeftSemiAntiJoin{}}}:
{code:java}
spark.conf.set(
  "spark.sql.optimizer.excludedRules",
  "org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin"
){code}
If I run that before running the problematic query (including your test case), 
it seems to work around the issue.

> Max iterations reached in Operator Optimization w/left_anti or left_semi join 
> and nested structures
> ---
>
> Key: SPARK-37222
> URL: https://issues.apache.org/jira/browse/SPARK-37222
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.2, 3.2.0, 3.2.1
> Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and 
> with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, 
> 2021.
> The problem does not occur with Spark 3.0.1.
>  
>Reporter: Shawn Smith
>Priority: Major
>
> The query optimizer never reaches a fixed point when optimizing the query 
> below. This manifests as a warning:
> > WARN: Max iterations (100) reached for batch Operator Optimization before 
> > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a 
> > larger value.
> But the suggested fix won't help. The actual problem is that the optimizer 
> fails to make progress on each iteration and gets stuck in a loop.
> In practice, Spark logs a warning but continues on and appears to execute the 
> query successfully, albeit perhaps sub-optimally.
> To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and 
> 3.2.0 but not 3.0.1 it will throw an exception:
> {noformat}
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id, 'nested("n"))
> .explain()
> }
> {noformat}
> Looking at the query plan as the optimizer iterates in 
> {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations:
> {noformat}
> Project [id#2, _gen_alias_108#108L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_108#108L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> And here's the plan after one more iteration. You can see that all that has 
> changed is new aliases for the column in the nested column: 
> "{{_gen_alias_108#108L}}" to "{{_gen_alias_109#109L}}".
> {noformat}
> Project [id#2, _gen_alias_109#109L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_109#109L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> The optimizer continues looping and tweaking the alias until it hits the max 
> iteration count and bails out.
> Here's an example that includes a stack trace:
> {noformat}
> $ bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0
>   /_/
> Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.12)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> case class Nested(b: Boolean, n:

[jira] [Updated] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures

2022-04-25 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-37222:
-
Affects Version/s: 3.2.1

> Max iterations reached in Operator Optimization w/left_anti or left_semi join 
> and nested structures
> ---
>
> Key: SPARK-37222
> URL: https://issues.apache.org/jira/browse/SPARK-37222
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.2, 3.2.0, 3.2.1
> Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and 
> with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, 
> 2021.
> The problem does not occur with Spark 3.0.1.
>  
>Reporter: Shawn Smith
>Priority: Major
>
> The query optimizer never reaches a fixed point when optimizing the query 
> below. This manifests as a warning:
> > WARN: Max iterations (100) reached for batch Operator Optimization before 
> > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a 
> > larger value.
> But the suggested fix won't help. The actual problem is that the optimizer 
> fails to make progress on each iteration and gets stuck in a loop.
> In practice, Spark logs a warning but continues on and appears to execute the 
> query successfully, albeit perhaps sub-optimally.
> To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and 
> 3.2.0 but not 3.0.1 it will throw an exception:
> {noformat}
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id, 'nested("n"))
> .explain()
> }
> {noformat}
> Looking at the query plan as the optimizer iterates in 
> {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations:
> {noformat}
> Project [id#2, _gen_alias_108#108L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_108#108L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> And here's the plan after one more iteration. You can see that all that has 
> changed is new aliases for the column in the nested column: 
> "{{_gen_alias_108#108L}}" to "{{_gen_alias_109#109L}}".
> {noformat}
> Project [id#2, _gen_alias_109#109L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_109#109L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> The optimizer continues looping and tweaking the alias until it hits the max 
> iteration count and bails out.
> Here's an example that includes a stack trace:
> {noformat}
> $ bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0
>   /_/
> Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.12)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id, 'nested("n"))
> .explain()
> }
> // Exiting paste mode, now interpreting.
> java.lang.RuntimeException: Max iterations (100) reached for batch Operator 
> Optimization befo

Re: Deluge of GitBox emails

2022-04-04 Thread Nicholas Chammas

I’m not familiar with GitBox, but it must be an independent thing. When you 
participate in a PR, GitHub emails you notifications directly.

The GitBox emails, on the other hand, are going to the dev list. They seem like 
something setup as a repo-wide setting, or perhaps as an Apache bot that 
monitors repo activity and converts it into emails. (I’ve seen other projects 
-- I think Hadoop -- where GitHub activity is converted into comments on Jira.

Turning off these GitBox emails should not have in impact on the usual GitHub 
emails we are all already familiar with.


> On Apr 4, 2022, at 9:47 AM, Sean Owen  wrote:
> 
> I think this must be related to the Gitbox migration that just happened. It 
> does seem like I'm getting more emails - some are on PRs I'm attached to, but 
> some I don't recognize. The thing is, I'm not yet clear if they duplicate the 
> normal Github emails - that is if we turn them off do we have anything?
> 
> On Mon, Apr 4, 2022 at 8:44 AM Nicholas Chammas  <mailto:nicholas.cham...@gmail.com>> wrote:
> I assume I’m not the only one getting these new emails from GitBox. Is there 
> a story behind that that I missed?
> 
> I’d rather not get these emails on the dev list. I assume most of the list 
> would agree with me.
> 
> GitHub has a good set of options for following activity on the repo. People 
> who want to follow conversations can easily do that without involving the 
> whole dev list.
> 
> Do we know who is responsible for these GitBox emails? Perhaps we need to 
> file an Apache INFRA ticket?
> 
> Nick
> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> <mailto:dev-unsubscr...@spark.apache.org>
>

Deluge of GitBox emails

2022-04-04 Thread Nicholas Chammas

I assume I’m not the only one getting these new emails from GitBox. Is there a 
story behind that that I missed?

I’d rather not get these emails on the dev list. I assume most of the list 
would agree with me.

GitHub has a good set of options for following activity on the repo. People who 
want to follow conversations can easily do that without involving the whole dev 
list.

Do we know who is responsible for these GitBox emails? Perhaps we need to file 
an Apache INFRA ticket?

Nick


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] Rename 'SQL' to 'SQL / DataFrame', and 'Query' to 'Execution' in SQL UI page

2022-03-28 Thread Nicholas Chammas

+1

Understanding the close relationship between SQL and DataFrames in Spark was a 
key learning moment for me, but I agree that using the terms interchangeably 
can be confusing.


> On Mar 27, 2022, at 9:27 PM, Hyukjin Kwon  wrote:
> 
> *for some reason, the image looks broken (to me). I am attaching again to 
> make sure.
> 
> 
> 
> On Mon, 28 Mar 2022 at 10:22, Hyukjin Kwon  > wrote:
> Hi all,
> 
> I have been investigating the improvements for Pandas API on Spark 
> specifically in UI.
> I chatted with a couple of people, and decided to send an email here to 
> discuss more.
> 
> Currently, both SQL and DataFrame API are shown in “SQL” tab as below:
> 
> 
> 
> which makes sense to developers because DataFrame API shares the same SQL 
> core but
> I do believe this makes less sense to end users. Please consider two more 
> points:
> 
> Spark ML users will run DataFrame-based MLlib API, but they will have to 
> check the "SQL" tab.
> Pandas API on Spark arguably has no link to SQL itself conceptually. It makes 
> less sense to users of pandas API.
> 
> So I would like to propose to rename:
> "SQL" to "SQL/DataFrame"
> "Query" to "Execution"
> 
> There's a PR open at https://github.com/apache/spark/pull/35973 
> . Please let me know your 
> thoughts on this. 
> 
> Thanks.

[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2021-12-20 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462805#comment-17462805
 ] 

Nicholas Chammas commented on SPARK-5997:
-

[~tenstriker] - I believe in your case you should be able to set 
{{spark.sql.files.maxRecordsPerFile}} to some number. Spark will not shuffle 
the data but it will still split up your output across multiple files.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-5997) Increase partition count without performing a shuffle

2021-12-20 Thread Nicholas Chammas (Jira)



[ https://issues.apache.org/jira/browse/SPARK-5997 ]


Nicholas Chammas deleted comment on SPARK-5997:
-

was (Author: nchammas):
[~tenstriker] - I believe in your case you should be able to set 
{{spark.sql.files.maxRecordsPerFile}} to some number. Spark will not shuffle 
the data but it will still split up your output across multiple files.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-12-20 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462718#comment-17462718
 ] 

Nicholas Chammas commented on SPARK-24853:
--

I would expect something like that to yield an {{{}AnalysisException{}}}. Would 
that address your concern, or are you suggesting that it might be difficult to 
catch that sort of problem cleanly?

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Nicholas Chammas

Thanks for the suggestions. I suppose I should share a bit more about what
I tried/learned, so others who come later can understand why a
memory-efficient, exact median is not in Spark.

Spark's own ApproximatePercentile also uses QuantileSummaries internally
<https://github.com/apache/spark/blob/3f3201a7882b817a8a3ecbfeb369dde01e7689d8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala#L225-L237>.
QuantileSummaries is a helper class for computing approximate quantiles
with a single pass over the data. I don't think I can use it to compute an
exact median.

Spark does already have code to compute an exact median: Percentile
<https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L74-L80>.
Since it works like other Catalyst expressions, it computes the median with
a single pass over the data. It does that by loading all the data into a
buffer and sorting it in memory
<https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L209>.
This is why the leading comment on Percentile warns that too much data will
cause GC pauses and OOMs
<https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L37-L39>
.

So I think this is what Reynold was getting at: With the design of Catalyst
expressions as they are today, there is no way to save memory by making
multiple passes over the data. So an approximate median is your best bet if
you want to avoid high memory usage.

You can build an exact median by doing other things, like multiple passes
over the data, or by using window functions, but that can't be captured in
a Catalyst Expression.

On Wed, Dec 15, 2021 at 11:00 AM Fitch, Simeon  wrote:

> Nicholas,
>
> This may or may not be much help, but in RasterFrames we have an
> approximate quantiles Expression computed against Tiles (2d geospatial
> arrays) which makes use of
> `org.apache.spark.sql.catalyst.util.QuantileSummaries` to do the hard work.
> So perhaps a directionally correct example of doing what you look to do?
>
>
> https://github.com/locationtech/rasterframes/blob/develop/core/src/main/scala/org/locationtech/rasterframes/expressions/aggregates/ApproxCellQuantilesAggregate.scala
>
> In that same package are a number of other Aggregates, including
> declarative ones, which are another way of computing aggregations through
> composition of other Expressions.
>
> Simeon
>
>
>
>
>
> On Thu, Dec 9, 2021 at 9:26 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I'm trying to create a new aggregate function. It's my first time working
>> with Catalyst, so it's exciting---but I'm also in a bit over my head.
>>
>> My goal is to create a function to calculate the median
>> <https://issues.apache.org/jira/browse/SPARK-26589>.
>>
>> As a very simple solution, I could just define median to be an alias of 
>> `Percentile(col,
>> 0.5)`. However, the leading comment on the Percentile expression
>> <https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L37-L39>
>> highlights that it's very memory-intensive and can easily lead to
>> OutOfMemory errors.
>>
>> So instead of using Percentile, I'm trying to create an Expression that
>> calculates the median without needing to hold everything in memory at once.
>> I'm considering two different approaches:
>>
>> 1. Define Median as a combination of existing expressions: The median
>> can perhaps be built out of the existing expressions for Count
>> <https://github.com/apache/spark/blob/9af338cd685bce26abbc2dd4d077bde5068157b1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L48>
>> and NthValue
>> <https://github.com/apache/spark/blob/568ad6aa4435ce76ca3b5d9966e64259ea1f9b38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L675>
>> .
>>
>> I don't see a template I can follow for building a new expression out of
>> existing expressions (i.e. without having to implement a bunch of methods
>> for DeclarativeAggregate or ImperativeAggregate). I also don't know how I
>> would wrap NthValue to make it usable as a regular aggregate function. The
>> wrapped NthValue would need an implicit window that provides the necessary
>> ordering.
>>
>>
>&g

[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-12-14 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459601#comment-17459601
 ] 

Nicholas Chammas commented on SPARK-24853:
--

Assuming we are talking about the example I provided: Yes, {{col("count")}} 
would still be ambiguous.

I don't know if Spark would know to catch that problem. But note that the 
current behavior of {{.withColumnRenamed('count', ...)}} renames all columns 
named "count", which is just incorrect.

So allowing {{col("count")}} will either be just as incorrect as the current 
behavior, or it will be an improvement in that Spark may complain that the 
column reference is ambiguous. I'd have to try it to confirm the behavior.

Of course, the main improvement offered by {{Column}} references is that users 
can do something like {{.withColumnRenamed(left_counts['count'], ...)}} and get 
the correct behavior.

I didn't follow what you are getting at regarding {{{}from_json{}}}, but does 
that address your concern?

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2021-12-14 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-25150.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

It looks like Spark 3.1.2 exhibits a different sort of broken behavior:
{code:java}
pyspark.sql.utils.AnalysisException: Column State#38 are ambiguous. It's 
probably because you joined several Datasets together, and some of these 
Datasets are the same. This column points to one of the Datasets but Spark is 
unable to figure out which one. Please alias the Datasets with different names 
via `Dataset.as` before joining them, and specify the column using qualified 
name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set 
spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check. {code}
I don't think the join in {{zombie-analysis.py}} is ambiguous, and since this 
now works fine in Spark 3.2.0, that's what I'm going to mark as the "Fix 
Version" for this issue.

The fix must have made it in somewhere between Spark 3.1.2 and 3.2.0.

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
>Reporter: Nicholas Chammas
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.0
>
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2021-12-14 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459494#comment-17459494
 ] 

Nicholas Chammas commented on SPARK-25150:
--

I re-ran my test (described in the issue description + summarized in my comment 
just above) on Spark 3.2.0, and this issue appears to be resolved! Whether with 
cross joins enabled or disabled, I now get the correct results.

Obviously, I have no clue what change since Spark 2.4.3 (the last time I reran 
this test) was responsible for the fix.

But to be clear, in case anyone wants to reproduce my test:
 # Download all 6 files attached to this issue into a folder.
 # Then, from within that folder, run {{spark-submit zombie-analysis.py}} and 
inspect the output.
 # Then, enable cross joins (commented out on line 9), rerun the script, and 
reinspect the output.
 # Compare the final bit of output from both runs against 
{{{}expected-output.txt{}}}.

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
>    Reporter: Nicholas Chammas
>Priority: Major
>  Labels: correctness
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (HADOOP-18029) Update CompressionCodecFactory to handle uppercase file extensions

2021-12-14 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/HADOOP-18029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459478#comment-17459478
 ] 

Nicholas Chammas commented on HADOOP-18029:
---

I have not contributed code to Hadoop before. Would you consider HADOOP-17562 a 
loosely related issue? It proposes to enable users to explicitly specify the 
input file codec to use (like they already can for output files).

If users were able to do that, they would be able to work around issues where 
Hadoop does not auto-detect the correct codec to use (e.g. because the 
extension is upper case). Is that correct?

> Update CompressionCodecFactory to handle uppercase file extensions
> --
>
> Key: HADOOP-18029
> URL: https://issues.apache.org/jira/browse/HADOOP-18029
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: common, io, test
> Environment: Tested locally on macOS 11.6.1, IntelliJ IDEA 2021.2.3, 
> running maven commands through terminal. Forked from trunk branch on November 
> 29th, 2021.
>Reporter: Desmond Sisson
>Assignee: Desmond Sisson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> I've updated the CompressionCodecFactory to be able to handle filenames with 
> capitalized compression extensions. Two of the three maps internal to the 
> class which are used to store codecs have existing lowercase casts, but it is 
> absent from the call inside getCodec() used for comparing path names.
> I updated the corresponding unit test in TestCodecFactory to include intended 
> use cases, and confirmed the test passes with the change. I also updated the 
> error message in the case of a null from an NPE to a rich error message. I've 
> resolved all checkstyle violations within the changed files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Updated] (HADOOP-17562) Provide mechanism for explicitly specifying the compression codec for input files

2021-12-14 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-17562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated HADOOP-17562:
--
Component/s: io

> Provide mechanism for explicitly specifying the compression codec for input 
> files
> -
>
> Key: HADOOP-17562
> URL: https://issues.apache.org/jira/browse/HADOOP-17562
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> I come to you via SPARK-29280.
> I am looking for the file _input_ equivalents of the following settings:
> {code:java}
> mapreduce.output.fileoutputformat.compress
> mapreduce.map.output.compress{code}
> Right now, I understand that Hadoop infers the codec to use when reading a 
> file from the file's extension.
> However, in some cases the files may have the incorrect extension or no 
> extension. There are links to some examples from SPARK-29280.
> Ideally, you should be able to explicitly specify the codec to use to read 
> those files. I don't believe that's possible today. Instead, the current 
> workaround appears to be to [create a custom codec 
> class|https://stackoverflow.com/a/17152167/877069] and override the 
> getDefaultExtension method to specify the extension to expect.
> Does it make sense to offer an explicit way to select the compression codec 
> for file input, mirroring how things work for file output?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-12-14 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459467#comment-17459467
 ] 

Nicholas Chammas commented on SPARK-24853:
--

[~hyukjin.kwon] - Are you still opposed to this proposed improvement? If not, 
I'd like to work on it.

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26589) proper `median` method for spark dataframe

2021-12-14 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-26589.
--
Resolution: Won't Fix

Marking this as "Won't Fix", but I suppose if someone really wanted to, they 
could reopen this issue and propose adding a median function that is simply an 
alias for {{{}percentile(col, 0.5){}}}. Don't know how the committers would 
feel about that.

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe

2021-12-14 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459455#comment-17459455
 ] 

Nicholas Chammas commented on SPARK-26589:
--

It looks like making a distributed, memory-efficient implementation of median 
is not possible using the design of Catalyst as it stands today. For more 
details, please see [this thread on the dev 
list|http://mail-archives.apache.org/mod_mbox/spark-dev/202112.mbox/%3cCAOhmDzev8d4H20XT1hUP9us=cpjeysgcf+xev7lg7dka1gj...@mail.gmail.com%3e].

It's possible to get an exact median today by using {{{}percentile(col, 
0.5){}}}, which is [available via the SQL 
API|https://spark.apache.org/docs/3.2.0/sql-ref-functions-builtin.html#aggregate-functions].
 It's not memory-efficient, so it may not work well on large datasets.

The Python and Scala DataFrame APIs do not offer this exact percentile 
function, so I've filed SPARK-37647 to track exposing this function in those 
APIs.

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37647) Expose percentile function in Scala/Python APIs

2021-12-14 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-37647:


 Summary: Expose percentile function in Scala/Python APIs
 Key: SPARK-37647
 URL: https://issues.apache.org/jira/browse/SPARK-37647
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.2.0
Reporter: Nicholas Chammas


SQL offers a percentile function (exact, not approximate) that is not available 
directly in the Scala or Python DataFrame APIs.

While it is possible to invoke SQL functions from Scala or Python via 
{{{}expr(){}}}, I think most users expect function parity across Scala, Python, 
and SQL. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Nicholas Chammas

Yeah, I think approximate percentile is good enough most of the time.

I don't have a specific need for a precise median. I was interested in
implementing it more as a Catalyst learning exercise, but it turns out I
picked a bad learning exercise to solve. :)

On Mon, Dec 13, 2021 at 9:46 PM Reynold Xin  wrote:

> tl;dr: there's no easy way to implement aggregate expressions that'd
> require multiple pass over data. It is simply not something that's
> supported and doing so would be very high cost.
>
> Would you be OK using approximate percentile? That's relatively cheap.
>
>
>
> On Mon, Dec 13, 2021 at 6:43 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> No takers here? :)
>>
>> I can see now why a median function is not available in most data
>> processing systems. It's pretty annoying to implement!
>>
>> On Thu, Dec 9, 2021 at 9:25 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I'm trying to create a new aggregate function. It's my first time
>>> working with Catalyst, so it's exciting---but I'm also in a bit over my
>>> head.
>>>
>>> My goal is to create a function to calculate the median
>>> <https://issues.apache.org/jira/browse/SPARK-26589>.
>>>
>>> As a very simple solution, I could just define median to be an alias of 
>>> `Percentile(col,
>>> 0.5)`. However, the leading comment on the Percentile expression
>>> <https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L37-L39>
>>> highlights that it's very memory-intensive and can easily lead to
>>> OutOfMemory errors.
>>>
>>> So instead of using Percentile, I'm trying to create an Expression that
>>> calculates the median without needing to hold everything in memory at once.
>>> I'm considering two different approaches:
>>>
>>> 1. Define Median as a combination of existing expressions: The median
>>> can perhaps be built out of the existing expressions for Count
>>> <https://github.com/apache/spark/blob/9af338cd685bce26abbc2dd4d077bde5068157b1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L48>
>>> and NthValue
>>> <https://github.com/apache/spark/blob/568ad6aa4435ce76ca3b5d9966e64259ea1f9b38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L675>
>>> .
>>>
>>> I don't see a template I can follow for building a new expression out of
>>> existing expressions (i.e. without having to implement a bunch of methods
>>> for DeclarativeAggregate or ImperativeAggregate). I also don't know how I
>>> would wrap NthValue to make it usable as a regular aggregate function. The
>>> wrapped NthValue would need an implicit window that provides the necessary
>>> ordering.
>>>
>>>
>>> Is there any potential to this idea? Any pointers on how to implement it?
>>>
>>>
>>> 2. Another memory-light approach to calculating the median requires
>>> multiple passes over the data to converge on the answer. The approach is 
>>> described
>>> here
>>> <https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers>.
>>> (I posted a sketch implementation of this approach using Spark's user-level
>>> API here
>>> <https://issues.apache.org/jira/browse/SPARK-26589?focusedCommentId=17452081=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17452081>
>>> .)
>>>
>>> I am also struggling to understand how I would build an aggregate
>>> function like this, since it requires multiple passes over the data. From
>>> what I can see, Catalyst's aggregate functions are designed to work with a
>>> single pass over the data.
>>>
>>> We don't seem to have an interface for AggregateFunction that supports
>>> multiple passes over the data. Is there some way to do this?
>>>
>>>
>>> Again, this is my first serious foray into Catalyst. Any specific
>>> implementation guidance is appreciated!
>>>
>>> Nick
>>>
>>
>

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Nicholas Chammas

No takers here? :)

I can see now why a median function is not available in most data
processing systems. It's pretty annoying to implement!

On Thu, Dec 9, 2021 at 9:25 PM Nicholas Chammas 
wrote:

> I'm trying to create a new aggregate function. It's my first time working
> with Catalyst, so it's exciting---but I'm also in a bit over my head.
>
> My goal is to create a function to calculate the median
> <https://issues.apache.org/jira/browse/SPARK-26589>.
>
> As a very simple solution, I could just define median to be an alias of 
> `Percentile(col,
> 0.5)`. However, the leading comment on the Percentile expression
> <https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L37-L39>
> highlights that it's very memory-intensive and can easily lead to
> OutOfMemory errors.
>
> So instead of using Percentile, I'm trying to create an Expression that
> calculates the median without needing to hold everything in memory at once.
> I'm considering two different approaches:
>
> 1. Define Median as a combination of existing expressions: The median can
> perhaps be built out of the existing expressions for Count
> <https://github.com/apache/spark/blob/9af338cd685bce26abbc2dd4d077bde5068157b1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L48>
> and NthValue
> <https://github.com/apache/spark/blob/568ad6aa4435ce76ca3b5d9966e64259ea1f9b38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L675>
> .
>
> I don't see a template I can follow for building a new expression out of
> existing expressions (i.e. without having to implement a bunch of methods
> for DeclarativeAggregate or ImperativeAggregate). I also don't know how I
> would wrap NthValue to make it usable as a regular aggregate function. The
> wrapped NthValue would need an implicit window that provides the necessary
> ordering.
>
>
> Is there any potential to this idea? Any pointers on how to implement it?
>
>
> 2. Another memory-light approach to calculating the median requires
> multiple passes over the data to converge on the answer. The approach is 
> described
> here
> <https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers>.
> (I posted a sketch implementation of this approach using Spark's user-level
> API here
> <https://issues.apache.org/jira/browse/SPARK-26589?focusedCommentId=17452081=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17452081>
> .)
>
> I am also struggling to understand how I would build an aggregate function
> like this, since it requires multiple passes over the data. From what I can
> see, Catalyst's aggregate functions are designed to work with a single pass
> over the data.
>
> We don't seem to have an interface for AggregateFunction that supports
> multiple passes over the data. Is there some way to do this?
>
>
> Again, this is my first serious foray into Catalyst. Any specific
> implementation guidance is appreciated!
>
> Nick
>
>

Creating a memory-efficient AggregateFunction to calculate Median

2021-12-09 Thread Nicholas Chammas

I'm trying to create a new aggregate function. It's my first time working
with Catalyst, so it's exciting---but I'm also in a bit over my head.

My goal is to create a function to calculate the median
.

As a very simple solution, I could just define median to be an alias
of `Percentile(col,
0.5)`. However, the leading comment on the Percentile expression

highlights that it's very memory-intensive and can easily lead to
OutOfMemory errors.

So instead of using Percentile, I'm trying to create an Expression that
calculates the median without needing to hold everything in memory at once.
I'm considering two different approaches:

1. Define Median as a combination of existing expressions: The median can
perhaps be built out of the existing expressions for Count

and NthValue

.

I don't see a template I can follow for building a new expression out of
existing expressions (i.e. without having to implement a bunch of methods
for DeclarativeAggregate or ImperativeAggregate). I also don't know how I
would wrap NthValue to make it usable as a regular aggregate function. The
wrapped NthValue would need an implicit window that provides the necessary
ordering.


Is there any potential to this idea? Any pointers on how to implement it?


2. Another memory-light approach to calculating the median requires
multiple passes over the data to converge on the answer. The approach
is described
here
.
(I posted a sketch implementation of this approach using Spark's user-level
API here

.)

I am also struggling to understand how I would build an aggregate function
like this, since it requires multiple passes over the data. From what I can
see, Catalyst's aggregate functions are designed to work with a single pass
over the data.

We don't seem to have an interface for AggregateFunction that supports
multiple passes over the data. Is there some way to do this?


Again, this is my first serious foray into Catalyst. Any specific
implementation guidance is appreciated!

Nick

[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe

2021-12-09 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456860#comment-17456860
 ] 

Nicholas Chammas commented on SPARK-26589:
--

That makes sense to me. I've been struggling with how to approach the 
implementation, so I've posted to the dev list asking for a little more help.

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Nicholas Chammas

Farewell to Jenkins and its classic weather forecast build status icons:

[image: health-80plus.png][image: health-60to79.png][image:
health-40to59.png][image: health-20to39.png][image: health-00to19.png]

And thank you Shane for all the help over these years.

Will you be nuking all the Jenkins-related code in the repo after the 23rd?

On Mon, Dec 6, 2021 at 3:02 PM shane knapp ☠  wrote:

> hey everyone!
>
> after a marathon run of nearly a decade, we're finally going to be
> shutting down {amp|rise}lab jenkins at the end of this month...
>
> the earliest snapshot i could find is from 2013 with builds for spark 0.7:
>
> https://web.archive.org/web/20130426155726/https://amplab.cs.berkeley.edu/jenkins/
>
> it's been a hell of a run, and i'm gonna miss randomly tweaking the build
> system, but technology has moved on and running a dedicated set of servers
> for just one open source project is just too expensive for us here at uc
> berkeley.
>
> if there's interest, i'll fire up a zoom session and all y'alls can watch
> me type the final command:
>
> systemctl stop jenkins
>
> feeling bittersweet,
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe

2021-12-01 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17452081#comment-17452081
 ] 

Nicholas Chammas commented on SPARK-26589:
--

[~srowen] - I'll ask for help on the dev list if appropriate, but I'm wondering 
if you can give me some high level guidance here.

I have an outline of an approach to calculate the median that does not require 
sorting or shuffling the data. It's based on the approach I linked to in my 
previous comment (by Michael Harris). It does require, however, multiple passes 
over the data for the algorithm to converge on the median.

Here's a working sketch of the approach:
{code:python}
def spark_median(data):
total_count = data.count()
if total_count % 2 == 0:
target_positions = [total_count // 2, total_count // 2 + 1]
else:
target_positions = [total_count // 2 + 1]
target_values = [
kth_position(data, k, data_count=total_count)
for k in target_positions
]
return sum(target_values) / len(target_values)


def kth_position(data, k, data_count=None):
if data_count is None:
total_count = data.count()
else:
total_count = data_count
if k > total_count or k < 1:
return None
while True:
# This value, along with the following two counts, are the only data 
that need
# to be shared across nodes.
some_value = data.first()["id"]
# These two counts can be performed together via an aggregator.
larger_count = data.where(col("id") > some_value).count()
equal_count = data.where(col("id") == some_value).count()
value_positions = range(
total_count - larger_count - equal_count + 1,
total_count - larger_count + 1,
)
# print(some_value, total_count, k, value_positions)
if k in value_positions:
return some_value
elif k >= value_positions.stop:
k -= (value_positions.stop - 1)
data = data.where(col("id") > some_value)
total_count = larger_count
elif k < value_positions.start:
data = data.where(col("id") < some_value)
total_count -= (larger_count + equal_count)
{code}
Of course, this needs to be converted into a Catalyst Expression, but the basic 
idea is expressed there.

I am looking at the definitions of 
[DeclarativeAggregate|https://github.com/apache/spark/blob/1dd0ca23f64acfc7a3dc697e19627a1b74012a2d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L381-L394]
 and 
[ImperativeAggregate|https://github.com/apache/spark/blob/1dd0ca23f64acfc7a3dc697e19627a1b74012a2d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L267-L285]
 and trying to find an existing expression to model after, but I don't think we 
have any existing aggregates that would work like this median—specifically, 
where multiple passes over the data are required (in this case, to count 
elements matching different filters).

Do you have any advice on how to approach converting this into a Catalyst 
expression?

There is an 
[NthValue|https://github.com/apache/spark/blob/568ad6aa4435ce76ca3b5d9966e64259ea1f9b38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L648-L675]
 window expression, but I don't think I can build on it to make my median 
expression since a) median shouldn't be limited to window expressions, and b) 
NthValue requires a complete sort of the data, which I want to avoid.

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe

2021-12-01 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451936#comment-17451936
 ] 

Nicholas Chammas commented on SPARK-26589:
--

Just for reference, Stack Overflow provides evidence that a proper median 
function has been in high demand for some time:
 * [How can I calculate exact median with Apache 
Spark?|https://stackoverflow.com/q/28158729/877069] (14K views)
 * [How to find median and quantiles using 
Spark|https://stackoverflow.com/q/31432843/877069] (117K views)
 * [Median / quantiles within PySpark 
groupBy|https://stackoverflow.com/q/46845672/877069] (67K views)

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26589) proper `median` method for spark dataframe

2021-11-30 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451283#comment-17451283
 ] 

Nicholas Chammas edited comment on SPARK-26589 at 11/30/21, 6:17 PM:
-

I think there is a potential solution using the algorithm [described here by 
Michael 
Harris|https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers].


was (Author: nchammas):
I'm going to try to implement this using the algorithm [described here by 
Michael 
Harris|https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers].

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe

2021-11-30 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451283#comment-17451283
 ] 

Nicholas Chammas commented on SPARK-26589:
--

I'm going to try to implement this using the algorithm [described here by 
Michael 
Harris|https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers].

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12185) Add Histogram support to Spark SQL/DataFrames

2021-11-30 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-12185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-12185:
-
Labels:   (was: bulk-closed)

> Add Histogram support to Spark SQL/DataFrames
> -
>
> Key: SPARK-12185
> URL: https://issues.apache.org/jira/browse/SPARK-12185
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Holden Karau
>Priority: Minor
>
> While we have the ability to compute histograms on RDDs of Doubles it would 
> be good to also directly support histograms in Spark SQL (see 
> https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining#StatisticsAndDataMining-histogram_numeric():Estimatingfrequencydistributions
>  ).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-12185) Add Histogram support to Spark SQL/DataFrames

2021-11-30 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-12185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas reopened SPARK-12185:
--

Reopening this because I think it's a valid improvement that mirrors the 
existing {{RDD.histogram}} method.

> Add Histogram support to Spark SQL/DataFrames
> -
>
> Key: SPARK-12185
> URL: https://issues.apache.org/jira/browse/SPARK-12185
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Holden Karau
>Priority: Minor
>  Labels: bulk-closed
>
> While we have the ability to compute histograms on RDDs of Doubles it would 
> be good to also directly support histograms in Spark SQL (see 
> https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining#StatisticsAndDataMining-histogram_numeric():Estimatingfrequencydistributions
>  ).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37393) Inline annotations for {ml, mllib}/common.py

2021-11-22 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-37393.
--
Resolution: Duplicate

> Inline annotations for {ml, mllib}/common.py
> 
>
> Key: SPARK-37393
> URL: https://issues.apache.org/jira/browse/SPARK-37393
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.2.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> This will allow us to run type checks against those files themselves.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37393) Inline annotations for {ml, mllib}/common.py

2021-11-19 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-37393:
-
Description: This will allow us to run type checks against those files 
themselves.

> Inline annotations for {ml, mllib}/common.py
> 
>
> Key: SPARK-37393
> URL: https://issues.apache.org/jira/browse/SPARK-37393
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.2.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> This will allow us to run type checks against those files themselves.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37393) Inline annotations for {ml, mllib}/common.py

2021-11-19 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-37393:


 Summary: Inline annotations for {ml, mllib}/common.py
 Key: SPARK-37393
 URL: https://issues.apache.org/jira/browse/SPARK-37393
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, PySpark
Affects Versions: 3.2.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37380) Miscellaneous Python lint infra cleanup

2021-11-18 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-37380:


 Summary: Miscellaneous Python lint infra cleanup
 Key: SPARK-37380
 URL: https://issues.apache.org/jira/browse/SPARK-37380
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra, PySpark
Affects Versions: 3.2.0
Reporter: Nicholas Chammas


* pyintrc is obsolete
 * tox.ini should be reformatted for easy reading/updating
 * requirements used on CI should be in requirements.txt



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Nicholas Chammas

Side note about time travel: There is a PR
 to add VERSION/TIMESTAMP AS OF
syntax to Spark SQL.

On Mon, Nov 15, 2021 at 2:23 PM Ryan Blue  wrote:

> I want to note that I wouldn't recommend time traveling this way by using
> the hint for `snapshot-id`. Instead, we want to add the standard SQL syntax
> for that in a separate PR. This is useful for other options that help a
> table scan perform better, like specifying the target split size.
>
> You're right that this isn't a typical optimizer hint, but I'm not sure
> what other syntax is possible for this use case. How else would we send
> custom properties through to the scan?
>
> On Mon, Nov 15, 2021 at 9:25 AM Mich Talebzadeh 
> wrote:
>
>> I am looking at the hint and it appears to me (I stand corrected), it is
>> a single table hint as below:
>>
>> -- time travel
>> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>>
>> My assumption is that any view on this table will also benefit from this
>> hint. This is not a hint to optimizer in a classical sense. Only a snapshot
>> hint. Normally, a hint is an instruction to the optimizer. When writing
>> SQL, one may know information about the data unknown to the optimizer.
>> Hints enable one to make decisions normally made by the optimizer,
>> sometimes causing the optimizer to select a plan that it sees as higher
>> cost.
>>
>>
>> So far as this case is concerned, it looks OK and I concur it should be
>> extended to write as well.
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 15 Nov 2021 at 17:02, Russell Spitzer 
>> wrote:
>>
>>> I think since we probably will end up using this same syntax on write,
>>> this makes a lot of sense. Unless there is another good way to express a
>>> similar concept during a write operation I think going forward with this
>>> would be ok.
>>>
>>> On Mon, Nov 15, 2021 at 10:44 AM Ryan Blue  wrote:
>>>
 The proposed feature is to be able to pass options through SQL like you
 would when using the DataFrameReader API, so it would work for all
 sources that support read options. Read options are part of the DSv2 API,
 there just isn’t a way to pass options when using SQL. The PR also has a
 non-Iceberg example, which is being able to customize some JDBC source
 behaviors per query (e.g., fetchSize), rather than globally in the table’s
 options.

 The proposed syntax is odd, but I think that's an artifact of Spark
 introducing read options that aren't a normal part of SQL. Seems reasonable
 to me to pass them through a hint.

 On Mon, Nov 15, 2021 at 2:18 AM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Interesting.
>
> What is this going to add on top of support for Apache Iceberg
> . Will it be in
> line with support for Hive ACID tables or Delta Lake?
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
>
> On Mon, 15 Nov 2021 at 01:56, Zhun Wang 
> wrote:
>
>> Hi dev,
>>
>> We are discussing Support Dynamic Table Options for Spark SQL (
>> https://github.com/apache/spark/pull/34072). It is currently not
>> sure if the syntax makes sense, and would like to know if there is other
>> feedback or opinion on this.
>>
>> I would appreciate any feedback on this.
>>
>> Thanks.
>>
>

 --
 Ryan Blue
 Tabular

>>>
>
> --
> Ryan Blue
> Tabular
>

[jira] [Updated] (SPARK-37336) Migrate _java2py to SparkSession

2021-11-15 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-37336:
-
Summary: Migrate _java2py to SparkSession  (was: Migrate common ML utils to 
SparkSession)

> Migrate _java2py to SparkSession
> 
>
> Key: SPARK-37336
> URL: https://issues.apache.org/jira/browse/SPARK-37336
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> {{_java2py()}} uses a deprecated method to create a SparkSession.
>  
> https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/ml/common.py#L99



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37336) Migrate common ML utils to SparkSession

2021-11-15 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-37336:


 Summary: Migrate common ML utils to SparkSession
 Key: SPARK-37336
 URL: https://issues.apache.org/jira/browse/SPARK-37336
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.2.0
Reporter: Nicholas Chammas


{{_java2py()}} uses a deprecated method to create a SparkSession.
 
https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/ml/common.py#L99



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37335) Clarify output of FPGrowth

2021-11-15 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-37335:
-
Description: 
The association rules returned by FPGrow include more columns than are 
documented, like {{{}lift{}}}:

[https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]

We should offer a basic description of these columns. An _itemset_ should also 
be briefly defined.

  was:
The association rules returned by FPGrow include more columns than are 
documented:

[https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]

We should offer a basic description of these columns.


> Clarify output of FPGrowth
> --
>
> Key: SPARK-37335
> URL: https://issues.apache.org/jira/browse/SPARK-37335
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 3.2.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> The association rules returned by FPGrow include more columns than are 
> documented, like {{{}lift{}}}:
> [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]
> We should offer a basic description of these columns. An _itemset_ should 
> also be briefly defined.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37335) Clarify output of FPGrowth

2021-11-15 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-37335:


 Summary: Clarify output of FPGrowth
 Key: SPARK-37335
 URL: https://issues.apache.org/jira/browse/SPARK-37335
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, ML
Affects Versions: 3.2.0
Reporter: Nicholas Chammas


The association rules returned by FPGrow include more columns than are 
documented:

[https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]

We should offer a basic description of these columns.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Jira components cleanup

2021-11-15 Thread Nicholas Chammas

https://issues.apache.org/jira/projects/SPARK?selectedItem=com.atlassian.jira.jira-projects-plugin:components-page

I think the "docs" component should be merged into "Documentation".

Likewise, the "k8" component should be merged into "Kubernetes".

I think anyone can technically update tags, but I think mass retagging
should be limited to admins (or at least, to someone who got prior approval
from an admin).

Nick

[jira] [Comment Edited] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-11-02 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437394#comment-17437394
 ] 

Nicholas Chammas edited comment on SPARK-24853 at 11/2/21, 2:41 PM:


[~hyukjin.kwon] - It's not just for consistency. As noted in the description, 
this is useful when you are trying to rename a column with an ambiguous name.

For example, imagine two tables {{left}} and {{right}}, each with a column 
called {{count}}:
{code:python}
(
  left_counts.alias('left')
  .join(right_counts.alias('right'), on='join_key')
  .withColumn(
'total_count',
left_counts['count'] + right_counts['count']
  )
  .withColumnRenamed('left.count', 'left_count')  # no-op; alias doesn't work
  .withColumnRenamed('count', 'left_count')  # incorrect; it renames both count 
columns
  .withColumnRenamed(left_counts['count'], 'left_count')  # what, ideally, 
users want to do here
  .show()
){code}
If you don't mind, I'm going to reopen this issue.


was (Author: nchammas):
[~hyukjin.kwon] - It's not just for consistency. As noted in the description, 
this is useful when you are trying to rename a column with an ambiguous name.

For example, imagine two tables {{left}} and {{right}}, each with a column 
called {{count}}:
{code:java}
(
  left_counts.alias('left')
  .join(right_counts.alias('right'), on='join_key')
  .withColumn(
'total_count',
left_counts['count'] + right_counts['count']
  )
  .withColumnRenamed('left.count', 'left_count')  # no-op; alias doesn't work
  .withColumnRenamed('count', 'left_count')  # incorrect; it renames both count 
columns
  .withColumnRenamed(left_counts['count'], 'left_count')  # what, ideally, 
users want to do here
  .show()
){code}
If you don't mind, I'm going to reopen this issue.

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-11-02 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437396#comment-17437396
 ] 

Nicholas Chammas commented on SPARK-24853:
--

The [contributing guide|https://spark.apache.org/contributing.html] isn't clear 
on how to populate "Affects Version" for improvements, so I've just tagged the 
latest release.

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-11-02 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-24853:
-
Priority: Minor  (was: Major)

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-11-02 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas reopened SPARK-24853:
--

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-11-02 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437394#comment-17437394
 ] 

Nicholas Chammas commented on SPARK-24853:
--

[~hyukjin.kwon] - It's not just for consistency. As noted in the description, 
this is useful when you are trying to rename a column with an ambiguous name.

For example, imagine two tables {{left}} and {{right}}, each with a column 
called {{count}}:
{code:java}
(
  left_counts.alias('left')
  .join(right_counts.alias('right'), on='join_key')
  .withColumn(
'total_count',
left_counts['count'] + right_counts['count']
  )
  .withColumnRenamed('left.count', 'left_count')  # no-op; alias doesn't work
  .withColumnRenamed('count', 'left_count')  # incorrect; it renames both count 
columns
  .withColumnRenamed(left_counts['count'], 'left_count')  # what, ideally, 
users want to do here
  .show()
){code}
If you don't mind, I'm going to reopen this issue.

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2
>Reporter: nirav patel
>Priority: Major
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-11-02 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-24853:
-
Affects Version/s: 3.2.0

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2021-04-15 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322333#comment-17322333
 ] 

Nicholas Chammas commented on SPARK-33000:
--

Per the discussion [on the dev 
list|http://apache-spark-developers-list.1001551.n3.nabble.com/Shutdown-cleanup-of-disk-based-resources-that-Spark-creates-td30928.html]
 and [PR|https://github.com/apache/spark/pull/31742], it seems we just want to 
update the documentation to clarify that {{cleanCheckpoints}} does not impact 
shutdown behavior. i.e. Checkpoints are not meant to be cleaned up on shutdown 
(whether planned or unplanned), and the config is currently working as intended.

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.
> Evidence the current behavior is confusing:
>  * [https://stackoverflow.com/q/52630858/877069]
>  * [https://stackoverflow.com/q/60009856/877069]
>  * [https://stackoverflow.com/q/61454740/877069]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Nicholas Chammas

On Tue, Mar 16, 2021 at 9:15 PM Hyukjin Kwon  wrote:

>   I am currently thinking we will have to convert the Koalas tests to use
> unittests to match with PySpark for now.
>
Keep in mind that pytest supports unittest-based tests out of the box
, so you should be able to
run pytest against the PySpark codebase without changing much about the
tests.

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Nicholas Chammas

On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin  wrote:

> I don't think we should deprecate existing APIs.
>

+1

I strongly prefer Spark's immutable DataFrame API to the Pandas API. I
could be wrong, but I wager most people who have worked with both Spark and
Pandas feel the same way.

For the large community of current PySpark users, or users switching to
PySpark from another Spark language API, it doesn't make sense to deprecate
the current API, even by convention.

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-11 Thread Nicholas Chammas

OK, perhaps the best course of action is to leave the current behavior
as-is but clarify the documentation for `.checkpoint()` and/or
`cleanCheckpoints`.

I personally find it confusing that `cleanCheckpoints` doesn't address
shutdown behavior, and the Stack Overflow links I shared
<https://issues.apache.org/jira/browse/SPARK-33000> show that many people
are in the same situation. There is clearly some demand for Spark to
automatically clean up checkpoints on shutdown. But perhaps that should
be... a new config? a rejected feature? something else? I dunno.

Does anyone else have thoughts on how to approach this?

On Wed, Mar 10, 2021 at 4:39 PM Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:

> > Checkpoint data is left behind after a normal shutdown, not just an
> unexpected shutdown. The PR description includes a simple demonstration of
> this.
>
> I think I might overemphasized a bit the "unexpected" adjective to show
> you the value in the current behavior.
>
> The feature configured with
> "spark.cleaner.referenceTracking.cleanCheckpoints" is about out of scoped
> references without ANY shutdown.
>
> It would be hard to distinguish that level (ShutdownHookManager) the
> unexpected from the intentional exits.
> As the user code (run by driver) could contain a System.exit() which was
> added by the developer for numerous reasons (this way distinguishing
> unexpected and not unexpected is not really an option).
> Even a third party library can contain s System.exit(). Would that be an
> unexpected exit or intentional? You can see it is hard to tell.
>
> To test the real feature
> behind "spark.cleaner.referenceTracking.cleanCheckpoints" you can create a
> reference within a scope which is closed. For example within the body of a
> function (without return value) and store it only in a local
> variable. After the scope is closed in case of our function when the caller
> gets the control back you have chance to see the context cleaner working
> (you might even need to trigger a GC too).
>
> On Wed, Mar 10, 2021 at 10:09 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Checkpoint data is left behind after a normal shutdown, not just an
>> unexpected shutdown. The PR description includes a simple demonstration of
>> this.
>>
>> If the current behavior is truly intended -- which I find difficult to
>> believe given how confusing <https://stackoverflow.com/q/52630858/877069>
>> it <https://stackoverflow.com/q/60009856/877069> is
>> <https://stackoverflow.com/q/61454740/877069> -- then at the very least
>> we need to update the documentation for both `.checkpoint()` and
>> `cleanCheckpoints` to make that clear.
>>
>> > This way even after an unexpected exit the next run of the same app
>> should be able to pick up the checkpointed data.
>>
>> The use case you are describing potentially makes sense. But preserving
>> checkpoint data after an unexpected shutdown -- even when
>> `cleanCheckpoints` is set to true -- is a new guarantee that is not
>> currently expressed in the API or documentation. At least as far as I can
>> tell.
>>
>> On Wed, Mar 10, 2021 at 3:10 PM Attila Zsolt Piros <
>> piros.attila.zs...@gmail.com> wrote:
>>
>>> Hi Nick!
>>>
>>> I am not sure you are fixing a problem here. I think what you see is as
>>> problem is actually an intended behaviour.
>>>
>>> Checkpoint data should outlive the unexpected shutdowns. So there is a
>>> very important difference between the reference goes out of scope during a
>>> normal execution (in this case cleanup is expected depending on the config
>>> you mentioned) and when a references goes out of scope because of an
>>> unexpected error (in this case you should keep the checkpoint data).
>>>
>>> This way even after an unexpected exit the next run of the same app
>>> should be able to pick up the checkpointed data.
>>>
>>> Best Regards,
>>> Attila
>>>
>>>
>>>
>>>
>>> On Wed, Mar 10, 2021 at 8:10 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Hello people,
>>>>
>>>> I'm working on a fix for SPARK-33000
>>>> <https://issues.apache.org/jira/browse/SPARK-33000>. Spark does not
>>>> cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate
>>>> configs are set.
>>>>
>>>> In the course of developing a fix, another contributor pointed out
>>>> <https://github.com/apache/spa

[jira] [Commented] (SPARK-33436) PySpark equivalent of SparkContext.hadoopConfiguration

2021-03-10 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299283#comment-17299283
 ] 

Nicholas Chammas commented on SPARK-33436:
--

[~hyukjin.kwon] - Can you clarify please why this ticket is "Won't Fix"? Just 
so it's clear for others who come across this ticket.

Is `._jsc` the intended way for PySpark users to set S3A configs?

> PySpark equivalent of SparkContext.hadoopConfiguration
> --
>
> Key: SPARK-33436
> URL: https://issues.apache.org/jira/browse/SPARK-33436
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> PySpark should offer an API to {{hadoopConfiguration}} to [match 
> Scala's|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#hadoopConfiguration:org.apache.hadoop.conf.Configuration].
> Setting Hadoop configs within a job is handy for any configurations that are 
> not appropriate as cluster defaults, or that will not be known until run 
> time. The various {{fs.s3a.*}} configs are a good example of this.
> Currently, what people are doing is setting things like this [via 
> SparkContext._jsc.hadoopConfiguration()|https://stackoverflow.com/a/32661336/877069].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Nicholas Chammas

Checkpoint data is left behind after a normal shutdown, not just an
unexpected shutdown. The PR description includes a simple demonstration of
this.

If the current behavior is truly intended -- which I find difficult to
believe given how confusing <https://stackoverflow.com/q/52630858/877069> it
<https://stackoverflow.com/q/60009856/877069> is
<https://stackoverflow.com/q/61454740/877069> -- then at the very least we
need to update the documentation for both `.checkpoint()` and
`cleanCheckpoints` to make that clear.

> This way even after an unexpected exit the next run of the same app
should be able to pick up the checkpointed data.

The use case you are describing potentially makes sense. But preserving
checkpoint data after an unexpected shutdown -- even when
`cleanCheckpoints` is set to true -- is a new guarantee that is not
currently expressed in the API or documentation. At least as far as I can
tell.

On Wed, Mar 10, 2021 at 3:10 PM Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:

> Hi Nick!
>
> I am not sure you are fixing a problem here. I think what you see is as
> problem is actually an intended behaviour.
>
> Checkpoint data should outlive the unexpected shutdowns. So there is a
> very important difference between the reference goes out of scope during a
> normal execution (in this case cleanup is expected depending on the config
> you mentioned) and when a references goes out of scope because of an
> unexpected error (in this case you should keep the checkpoint data).
>
> This way even after an unexpected exit the next run of the same app should
> be able to pick up the checkpointed data.
>
> Best Regards,
> Attila
>
>
>
>
> On Wed, Mar 10, 2021 at 8:10 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Hello people,
>>
>> I'm working on a fix for SPARK-33000
>> <https://issues.apache.org/jira/browse/SPARK-33000>. Spark does not
>> cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate
>> configs are set.
>>
>> In the course of developing a fix, another contributor pointed out
>> <https://github.com/apache/spark/pull/31742#issuecomment-790987483> that
>> checkpointed data may not be the only type of resource that needs a fix for
>> shutdown cleanup.
>>
>> I'm looking for a committer who might have an opinion on how Spark should
>> clean up disk-based resources on shutdown. The last people who contributed
>> significantly to the ContextCleaner, where this cleanup happens, were
>> @witgo <https://github.com/witgo> and @andrewor14
>> <https://github.com/andrewor14>. But that was ~6 years ago, and I don't
>> think they are active on the project anymore.
>>
>> Any takers to take a look and give their thoughts? The PR is small
>> <https://github.com/apache/spark/pull/31742>. +39 / -2.
>>
>> Nick
>>
>>

[jira] [Updated] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2021-03-10 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-33000:
-
Description: 
Maybe it's just that the documentation needs to be updated, but I found this 
surprising:
{code:python}
$ pyspark
...
>>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
>>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
>>> a = spark.range(10)
>>> a.checkpoint()
DataFrame[id: bigint]   
>>> exit(){code}
The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
Spark to clean it up on shutdown.

The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} says:
{quote}Controls whether to clean checkpoint files if the reference is out of 
scope.
{quote}
When Spark shuts down, everything goes out of scope, so I'd expect all 
checkpointed RDDs to be cleaned up.

For the record, I see the same behavior in both the Scala and Python REPLs.

Evidence the current behavior is confusing:
 * [https://stackoverflow.com/q/52630858/877069]
 * [https://stackoverflow.com/q/60009856/877069]
 * [https://stackoverflow.com/q/61454740/877069]

 

  was:
Maybe it's just that the documentation needs to be updated, but I found this 
surprising:
{code:python}
$ pyspark
...
>>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
>>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
>>> a = spark.range(10)
>>> a.checkpoint()
DataFrame[id: bigint]   
>>> exit(){code}
The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
Spark to clean it up on shutdown.

The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} says:
{quote}Controls whether to clean checkpoint files if the reference is out of 
scope.
{quote}
When Spark shuts down, everything goes out of scope, so I'd expect all 
checkpointed RDDs to be cleaned up.

For the record, I see the same behavior in both the Scala and Python REPLs.


> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.
> Evidence the current behavior is confusing:
>  * [https://stackoverflow.com/q/52630858/877069]
>  * [https://stackoverflow.com/q/60009856/877069]
>  * [https://stackoverflow.com/q/61454740/877069]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Nicholas Chammas

Hello people,

I'm working on a fix for SPARK-33000
. Spark does not cleanup
checkpointed RDDs/DataFrames on shutdown, even if the appropriate configs
are set.

In the course of developing a fix, another contributor pointed out
 that
checkpointed data may not be the only type of resource that needs a fix for
shutdown cleanup.

I'm looking for a committer who might have an opinion on how Spark should
clean up disk-based resources on shutdown. The last people who contributed
significantly to the ContextCleaner, where this cleanup happens, were @witgo
 and @andrewor14 .
But that was ~6 years ago, and I don't think they are active on the project
anymore.

Any takers to take a look and give their thoughts? The PR is small
. +39 / -2.

Nick

[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2021-03-04 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295531#comment-17295531
 ] 

Nicholas Chammas commented on SPARK-33000:
--

[~caowang888] - If you're still interested in this issue, take a look at my PR 
and let me know what you think. Hopefully, I've understood the issue correctly 
and proposed an appropriate fix.

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (HADOOP-17562) Provide mechanism for explicitly specifying the compression codec for input files

2021-03-03 Thread Nicholas Chammas (Jira)

Nicholas Chammas created HADOOP-17562:
-

 Summary: Provide mechanism for explicitly specifying the 
compression codec for input files
 Key: HADOOP-17562
 URL: https://issues.apache.org/jira/browse/HADOOP-17562
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Nicholas Chammas


I come to you via SPARK-29280.

I am looking for the file _input_ equivalents of the following settings:
{code:java}
mapreduce.output.fileoutputformat.compress
mapreduce.map.output.compress{code}
Right now, I understand that Hadoop infers the codec to use when reading a file 
from the file's extension.

However, in some cases the files may have the incorrect extension or no 
extension. There are links to some examples from SPARK-29280.

Ideally, you should be able to explicitly specify the codec to use to read 
those files. I don't believe that's possible today. Instead, the current 
workaround appears to be to [create a custom codec 
class|https://stackoverflow.com/a/17152167/877069] and override the 
getDefaultExtension method to specify the extension to expect.

Does it make sense to offer an explicit way to select the compression codec for 
file input, mirroring how things work for file output?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

[jira] [Created] (HADOOP-17562) Provide mechanism for explicitly specifying the compression codec for input files

2021-03-03 Thread Nicholas Chammas (Jira)

Nicholas Chammas created HADOOP-17562:
-

 Summary: Provide mechanism for explicitly specifying the 
compression codec for input files
 Key: HADOOP-17562
 URL: https://issues.apache.org/jira/browse/HADOOP-17562
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Nicholas Chammas


I come to you via SPARK-29280.

I am looking for the file _input_ equivalents of the following settings:
{code:java}
mapreduce.output.fileoutputformat.compress
mapreduce.map.output.compress{code}
Right now, I understand that Hadoop infers the codec to use when reading a file 
from the file's extension.

However, in some cases the files may have the incorrect extension or no 
extension. There are links to some examples from SPARK-29280.

Ideally, you should be able to explicitly specify the codec to use to read 
those files. I don't believe that's possible today. Instead, the current 
workaround appears to be to [create a custom codec 
class|https://stackoverflow.com/a/17152167/877069] and override the 
getDefaultExtension method to specify the extension to expect.

Does it make sense to offer an explicit way to select the compression codec for 
file input, mirroring how things work for file output?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Nicholas Chammas

On Thu, Feb 18, 2021 at 10:34 AM Sean Owen  wrote:

> There is no way to force people to review or commit something of course.
> And keep in mind we get a lot of, shall we say, unuseful pull requests.
> There is occasionally some blowback to closing someone's PR, so the path of
> least resistance is often the timeout / 'soft close'. That is, it takes a
> lot more time to satisfactorily debate down the majority of PRs that
> probably shouldn't get merged, and there just isn't that much bandwidth.
> That said of course it's bad if lots of good PRs are getting lost in the
> shuffle and I am sure there are some.
>
> One other aspect is that a committer is taking some degree of
> responsibility for merging a change, so the ask is more than just a few
> minutes of eyeballing. If it breaks something the merger pretty much owns
> resolving it, and, the whole project owns any consequence of the change for
> the future.
>

+1

Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Nicholas Chammas

On Thu, Feb 18, 2021 at 9:58 AM Enrico Minack 
wrote:

> *What is the approved way to ...*
>
> *... prevent it from being auto-closed?* Committing and commenting to the
> PR does not prevent it from being closed the next day.
>
Committing and commenting should prevent the PR from being closed. It may
be that commenting after the stale message has been posted does not work
(which would likely be a bug in the action
 or in our config
),
but there are PRs that have been open for months with consistent activity
that do not get closed.

So at the very least, proactively committing or commenting every month will
keep the PR open. However, that's not the real problem, right? The real
problem is getting committer attention.

*...** re-open it? *The comment says "If you'd like to revive this PR,
> please reopen it ...", but there is no re-open button anywhere on the PR!
>
> I don't know if there is a repo setting here that allows non-committers to
reopen their own closed PRs. At the very worst, you can always open a new
PR from the same branch, though we should update the stale message text if
contributors cannot in fact reopen their own PRs.

> What is the expected contributor's response to a PR that does not get
> feedback? Giving up?
>
I've baby-sat several PRs that took months to get in. Here's an example
 off the top of my head (5-6
months to be merged in). I'm sure everyone on here, including most
committers themselves, have had this experience. It's common. The expected
response is to be persistent, to try to find a committer or shepherd for
your PR, and to proactively make your PR easier to review.

> Are there processes in place to increase the probability PRs do not get
> forgotten, auto-closed and lost?
>
There are things you can do as a contributor to increase the likelihood
your PR will get reviewed, but I wouldn't call them "processes". This is an
open source project built on corporate sponsorship for some stuff and
volunteer energy for everything else. There is no guarantee or formal
obligation for anyone to do the work of reviewing PRs. That's just the
nature of open source work.

The things that you can do are:

   - Make your PR as small and focused as possible.
   - Make sure the build is passing and that you've followed the
   contributing guidelines.
   - Find the people who most recently worked on the area you're touching
   and ask them for help.
   - Address reviewers' requests and concerns.
   - Try to get some committer buy-in for your idea before spending time
   developing it.
   - Ask for input on the dev list for your PR.

Basically, most of the advice boils down to "make it easy for reviewers''.
Even then, though, sometimes things won't work out
 (5-6 months and closed without
merging). It's just the nature of contributing to a large project like
Spark where there is a lot going on.

[jira] [Resolved] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-02-08 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-34194.
--
Resolution: Won't Fix

> Queries that only touch partition columns shouldn't scan through all files
> --
>
> Key: SPARK-34194
> URL: https://issues.apache.org/jira/browse/SPARK-34194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==>  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>PRE file_date=2017-05-01/
>PRE file_date=2017-05-02/
>PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
> 1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-02-08 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281269#comment-17281269
 ] 

Nicholas Chammas commented on SPARK-34194:
--

It's not clear to me whether SPARK-26709 is describing an inherent design issue 
that has no fix, or whether SPARK-26709 simply captures a bug in the past 
implementation of {{OptimizeMetadataOnlyQuery}} which could conceivably be 
fixed in the future.

If it's something that could be fixed and reintroduced, this issue should stay 
open. If we know for design reasons that metadata-only queries cannot be made 
reliably correct, then this issue should be closed with a clear explanation to 
that effect.

> Queries that only touch partition columns shouldn't scan through all files
> --
>
> Key: SPARK-34194
> URL: https://issues.apache.org/jira/browse/SPARK-34194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==>  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>PRE file_date=2017-05-01/
>PRE file_date=2017-05-02/
>PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
> 1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-02-01 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276869#comment-17276869
 ] 

Nicholas Chammas edited comment on SPARK-34194 at 2/2/21, 5:56 AM:
---

Interesting reference, [~attilapiros]. It looks like that config is internal to 
Spark and was [deprecated in Spark 
3.0|https://github.com/apache/spark/blob/bec80d7eec91ee83fcbb0e022b33bd526c80f423/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L918-L929]
 due to the correctness issue mentioned in that warning and documented in 
SPARK-26709.


was (Author: nchammas):
Interesting reference, [~attilapiros]. It looks like that config was 
[deprecated in Spark 
3.0|https://github.com/apache/spark/blob/bec80d7eec91ee83fcbb0e022b33bd526c80f423/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L918-L929]
 due to the correctness issue mentioned in that warning and documented in 
SPARK-26709.

> Queries that only touch partition columns shouldn't scan through all files
> --
>
> Key: SPARK-34194
> URL: https://issues.apache.org/jira/browse/SPARK-34194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==>  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>PRE file_date=2017-05-01/
>PRE file_date=2017-05-02/
>PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
> 1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-02-01 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276869#comment-17276869
 ] 

Nicholas Chammas commented on SPARK-34194:
--

Interesting reference, [~attilapiros]. It looks like that config was 
[deprecated in Spark 
3.0|https://github.com/apache/spark/blob/bec80d7eec91ee83fcbb0e022b33bd526c80f423/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L918-L929]
 due to the correctness issue mentioned in that warning and documented in 
SPARK-26709.

> Queries that only touch partition columns shouldn't scan through all files
> --
>
> Key: SPARK-34194
> URL: https://issues.apache.org/jira/browse/SPARK-34194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==>  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>PRE file_date=2017-05-01/
>PRE file_date=2017-05-02/
>PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
> 1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2021-02-01 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276862#comment-17276862
 ] 

Nicholas Chammas commented on PARQUET-41:
-

Thanks for the link [~yumwang]. That 
[README|https://github.com/apache/parquet-mr/tree/master/parquet-hadoop#readme] 
is what I was looking for.

Are these docs published on the [documentation 
site|http://parquet.apache.org/documentation/latest/] anywhere, or is the 
README file on GitHub the canonical reference?

> Add bloom filters to parquet statistics
> ---
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format, parquet-mr
>Reporter: Alex Levenson
>Assignee: Junjie Chen
>Priority: Major
>  Labels: filter2, pull-request-available
> Fix For: format-2.7.0, 1.12.0
>
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2021-02-01 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276842#comment-17276842
 ] 

Nicholas Chammas commented on PARQUET-41:
-

Where is the user documentation for all the bloom filter-related functionality 
that will be released as part of parquet-mr 1.12? I'm thinking of user settings 
like {{parquet.filter.bloom.enabled}} and {{parquet.bloom.filter.*}}, along 
with anything else a user might care about.

For example, if a Spark user wants to use or configure bloom filters on their 
Parquet data, what documentation should they reference?

> Add bloom filters to parquet statistics
> ---
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format, parquet-mr
>Reporter: Alex Levenson
>Assignee: Junjie Chen
>Priority: Major
>  Labels: filter2, pull-request-available
> Fix For: format-2.7.0, 1.12.0
>
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[issue43094] sqlite3.create_function takes parameter named narg, not num_params

2021-02-01 Thread Nicholas Chammas



New submission from Nicholas Chammas :

The doc for sqlite3.create_function shows the signature as follows:

https://docs.python.org/3.9/library/sqlite3.html#sqlite3.Connection.create_function

```
create_function(name, num_params, func, *, deterministic=False)
```

But it appears that the parameter name is `narg`, not `num_params`. Trying 
`num_params` yields:

```
TypeError: function missing required argument 'narg' (pos 2)
```

--
assignee: docs@python
components: Documentation
messages: 386100
nosy: docs@python, nchammas
priority: normal
severity: normal
status: open
title: sqlite3.create_function takes parameter named narg, not num_params
versions: Python 3.9

___
Python tracker 
<https://bugs.python.org/issue43094>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-28 Thread Nicholas Chammas

On Thu, Jan 28, 2021 at 3:40 PM Sean Owen  wrote:

> It isn't that regexp_extract_all (for example) is useless outside SQL,
> just, where do you draw the line? Supporting 10s of random SQL functions
> across 3 other languages has a cost, which has to be weighed against
> benefit, which we can never measure well except anecdotally: one or two
> people say "I want this" in a sea of hundreds of thousands of users.
>

+1 to this, but I will add that Jira and Stack Overflow activity can
sometimes give good signals about API gaps that are frustrating users. If
there is an SO question with 30K views about how to do something that
should have been easier, then that's an important signal about the API.

For this specific case, I think there is a fine argument
> that regexp_extract_all should be added simply for consistency
> with regexp_extract. I can also see the argument that regexp_extract was a
> step too far, but, what's public is now a public API.
>

I think in this case a few references to where/how people are having to
work around missing a direct function for regexp_extract_all could help
guide the decision. But that itself means we are making these decisions on
a case-by-case basis.

>From a user perspective, it's definitely conceptually simpler to have SQL
functions be consistent and available across all APIs.

Perhaps if we had a way to lower the maintenance burden of keeping
functions in sync across SQL/Scala/Python/R, it would be easier for
everyone to agree to just have all the functions be included across the
board all the time.

Would, for example, some sort of automatic testing mechanism for SQL
functions help here? Something that uses a common function testing
specification to automatically test SQL, Scala, Python, and R functions,
without requiring maintainers to write tests for each language's version of
the functions. Would that address the maintenance burden?

[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2021-01-21 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269571#comment-17269571
 ] 

Nicholas Chammas commented on SPARK-12890:
--

I've created SPARK-34194 and fleshed out the description of the problem a bit.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>Priority: Minor
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-01-21 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-34194:


 Summary: Queries that only touch partition columns shouldn't scan 
through all files
 Key: SPARK-34194
 URL: https://issues.apache.org/jira/browse/SPARK-34194
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Nicholas Chammas


When querying only the partition columns of a partitioned table, it seems that 
Spark nonetheless scans through all files in the table, even though it doesn't 
need to.

Here's an example:
{code:python}
>>> data = spark.read.option('mergeSchema', 
>>> 'false').parquet('s3a://some/dataset')
[Stage 0:==>  (407 + 12) / 1158]
{code}
Note the 1158 tasks. This matches the number of partitions in the table, which 
is partitioned on a single field named {{file_date}}:
{code:sh}
$ aws s3 ls s3://some/dataset | head -n 3
   PRE file_date=2017-05-01/
   PRE file_date=2017-05-02/
   PRE file_date=2017-05-03/

$ aws s3 ls s3://some/dataset | wc -l
1158
{code}
The table itself has over 138K files, though:
{code:sh}
$ aws s3 ls --recursive --human --summarize s3://some/dataset
...
Total Objects: 138708
   Total Size: 3.7 TiB
{code}
Now let's try to query just the {{file_date}} field and see what Spark does.
{code:python}
>>> data.select('file_date').orderBy('file_date', 
>>> ascending=False).limit(1).explain()
== Physical Plan ==
TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
output=[file_date#11])
+- *(1) ColumnarToRow
   +- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct<>

>>> data.select('file_date').orderBy('file_date', 
>>> ascending=False).limit(1).show()
[Stage 2:>   (179 + 12) / 41011]
{code}
Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
job progresses? I'm not sure.

What I do know is that this operation takes a long time (~20 min) running from 
my laptop, whereas to list the top-level {{file_date}} partitions via the AWS 
CLI take a second or two.

Spark appears to be going through all the files in the table, when it just 
needs to list the partitions captured in the S3 "directory" structure. The 
query is only touching {{file_date}}, after all.

The current workaround for this performance problem / optimizer wastefulness, 
is to [query the catalog directly|https://stackoverflow.com/a/65724151/877069]. 
It works, but is a lot of extra work compared to the elegant query against 
{{file_date}} that users actually intend.

Spark should somehow know when it is only querying partition fields and skip 
iterating through all the individual files in a table.

Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 1992 matches

Mail list logo