[ https://issues.apache.org/jira/browse/SPARK-37055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432159#comment-17432159 ]
Hyukjin Kwon commented on SPARK-37055: -------------------------------------- Actually, it's more to prevent running Spark jobs only for the sake of input validation. For example, assume a pandas API requires to have the same values in its input: {code} def abc(df): if self.sort_values() != df.sort_values() raise Exception("all values have to be same") {code} and assume that the input {{df}} contains a very complicated computation chain. For example: {code} df = spark.read.csv().sort().repartition().sort().agg(...) {code} {code} another_df.abc(df) # would result in computing `df` two times (+ sort each df). {code} So, this JIRA aims to have the eager check by default (to match with pandas' behaviour) but provide an option to avoid such expensive computation. > Apply 'compute.eager_check' across all the codebase > --------------------------------------------------- > > Key: SPARK-37055 > URL: https://issues.apache.org/jira/browse/SPARK-37055 > Project: Spark > Issue Type: Umbrella > Components: PySpark > Affects Versions: 3.3.0 > Reporter: dch nguyen > Priority: Major > > As [~hyukjin.kwon] guide > 1 Make every input validation like this covered by the new configuration. > For example: > {code:python} > - a == b > + def eager_check(f): # Utility function > + return not config.compute.eager_check and f() > + > + eager_check(lambda: a == b) > {code} > 2 We should check if the output makes sense although the behaviour is not > matched with pandas'. If the output does not make sense, we shouldn't cover > it with this configuration. > 3 Make this configuration enabled by default so we match the behaviour to > pandas' by default. > > We have to make sure listing which API is affected in the description of > 'compute.eager_check' -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org