Re: Avoiding unnnecessary sort in FileFormatWriter/DynamicPartitionDataWriter

2020-09-08 Thread Cheng Su
Thanks, Ximo. On our side, we do see the similar cases in production as well 
and we added this feature internally couple years ago. Let me submit new PR 
(which is mostly to rebase https://github.com/apache/spark/pull/23163 to latest 
master and try to have better code structure), if there’s no objection.

Thanks,
Cheng Su

From: XIMO GUANTER GONZALBEZ 
Date: Sunday, September 6, 2020 at 10:55 PM
To: Cheng Su , Reynold Xin 
Cc: Spark Dev List 
Subject: RE: Avoiding unnnecessary sort in 
FileFormatWriter/DynamicPartitionDataWriter

> 1.If number of writers exceeds a pre-defined threshold (controlled by 
> a config), we sort rest of input rows, and fallback to current mode for write.
> The config can be disabled by default to be consistent with current behavior, 
> and users can choose to opt-in to non-sort mode if they are benefitted with 
> not sorting on large amount of data.

With both of those points in place, I think the plan is super reasonable since 
it wouldn’t affect anyone who isn’t actively tuning Spark, and enables those of 
us who are hitting this sort to have the tools to improve performance in our 
scenario.

Cheers,
Ximo.

De: Cheng Su 
Enviado el: viernes, 4 de septiembre de 2020 20:38
Para: Reynold Xin ; XIMO GUANTER GONZALBEZ 

CC: Spark Dev List 
Asunto: Re: Avoiding unnnecessary sort in 
FileFormatWriter/DynamicPartitionDataWriter

Hi,

Just for context - I created the JIRA for this around 2 years ago 
(https://issues.apache.org/jira/browse/SPARK-26164
 and a stale PR not merged - https://github.com/apache/spark/pull/23163), and I 
recently discussed with Wenchen again, it looks like it might be reasonable to:


  1.  Open multiple writers in parallel to write partitions/buckets.
  2.  If number of writers exceeds a pre-defined threshold (controlled by a 
config), we sort rest of input rows, and fallback to current mode for write.

The approach uses number of writers to be proxy for memory usage here, I agree 
this is quite rudimentary. But given memory usage from writers is non-visible 
to spark now, it seems to me that there’s no other good way to model the memory 
usage for write. Internally we did the thing in same way, but our internal ORC 
is customized to better work with internal Spark for memory usage so we don’t 
see much issue for OOM (non-vectorization code path).

The config can be disabled by default to be consistent with current behavior, 
and users can choose to opt-in to non-sort mode if they are benefitted with not 
sorting on large amount of data.

Does it sound good as a plan? Would like to get more opinion on this. Thanks.

Cheng Su

From: Reynold Xin mailto:r...@databricks.com>>
Date: Friday, September 4, 2020 at 10:33 AM
To: XIMO GUANTER GONZALBEZ 
mailto:joaquin.guantergonzal...@telefonica.com>>
Cc: Spark Dev List mailto:dev@spark.apache.org>>
Subject: Re: Avoiding unnnecessary sort in 
FileFormatWriter/DynamicPartitionDataWriter


The issue is memory overhead. Writing files create a lot of buffer (especially 
in columnar formats like Parquet/ORC). Even a few file handlers and buffers per 
task can OOM the entire process easily.


On Fri, Sep 04, 2020 at 5:51 AM, XIMO GUANTER GONZALBEZ 
mailto:joaquin.guantergonzal...@telefonica.com>>
 wrote:
Hello,

I have observed that if a DataFrame is saved with partitioning columns in 
Parquet, then a sort is performed in FileFormatWriter (see 
https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L152)
 because DynamicPartitionDataWriter only supports having a single file open at 
a time (see 
https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L170-L171).
 I think it would be possible to avoid this sort (which is a major bottleneck 
for some of my scenarios) if DynamicPartitionDataWriter could have multiple 
files open at the same time, and writing each piece of data to its 
corresponding file.

Would that change be a welcome PR for the project or is there any major problem 
that I am not considering that would prevent removing this sort?

Thanks,
Ximo.




Some more detail about the problem, in case I didn’t explain myself correctly: 
suppose we have a dataframe which we want to partition by column A:

Column A
Column B
4
A
1
B
2
C

The current behavior will first sort the dataframe:

Column A
Column B
1
B
2
C
4
A

So that DynamicPartitionDataWriter can have a single file open, since all the 
data for a single partition will be adjacent and can be iterated over 
sequentially. In order to process the first row, DynamicPartitionDataWriter 
will open a file in /columnA=1/part-r-0-.parquet and write the

Re: [VOTE] Release Spark 2.4.7 (RC3)

2020-09-08 Thread Mridul Muralidharan
+1

Signatures, digests, etc check out fine.
Checked out tag and built/tested with -Pyarn -Phadoop-2.7 -Phive
-Phive-thriftserver -Pmesos -Pkubernetes

Thanks,
Mridul


On Tue, Sep 8, 2020 at 8:55 AM Prashant Sharma  wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 2.4.7.
>
> The vote is open until Sep 11th at 9AM PST and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.7
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 2.4.7 (try project = SPARK AND
> "Target Version/s" = "2.4.7" AND status in (Open, Reopened, "In Progress"))
>
> The tag to be voted on is v2.4.7-rc3 (commit
> 14211a19f53bd0f413396582c8970e3e0a74281d):
> https://github.com/apache/spark/tree/v2.4.7-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1361/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc3-docs/
>
> The list of bug fixes going into 2.4.7 can be found at the following URL:
> https://s.apache.org/spark-v2.4.7-rc3
>
> This release is using the release script of the tag v2.4.7-rc3.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.7?
> ===
>
> The current list of open tickets targeted at 2.4.7 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.7
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


Re: [VOTE] Release Spark 2.4.7 (RC3)

2020-09-08 Thread Sean Owen
+1 from as with the last RC.

(This no no big deal, but
https://repository.apache.org/content/repositories/orgapachespark-1361/
says it is not exposed.)

I got a few failures in tests, but I think it is likely due to the
system I'm running on. If nobody else sees these (and Jenkins seems
OK) then I presume that's transient.

org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
  Exception encountered when invoking run on a nested suite -
spark-submit returned with exit code 1.
  Command line: './bin/spark-submit' '--name' 'prepare testing tables'
'--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf'
'spark.master.rest.enabled=false' '--conf'
'spark.sql.warehouse.dir=/mnt/data/spark-2.4.7/sql/hive/target/tmp/warehouse-b5a650df-7344-4d27-85f5-2ac35421e3f8'
'--conf' 'spark.sql.test.version.index=0' '--driver-java-options'
'-Dderby.system.home=/mnt/data/spark-2.4.7/sql/hive/target/tmp/warehouse-b5a650df-7344-4d27-85f5-2ac35421e3f8'
'/mnt/data/spark-2.4.7/sql/hive/target/tmp/test6621665241016055697.py'

- run Python application in yarn-client mode *** FAILED ***
 ...
 File "/mnt/data/spark-2.4.7/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 145, in 
File "/mnt/data/spark-2.4.7/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 126, in _make_cell_set_template_code
  TypeError: an integer is required (got type bytes)
  20/09/08 13:28:27 INFO ShutdownHookManager: Shutdown hook called

On Tue, Sep 8, 2020 at 8:54 AM Prashant Sharma  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.4.7.
>
> The vote is open until Sep 11th at 9AM PST and passes if a majority +1 PMC 
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.7
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 2.4.7 (try project = SPARK AND 
> "Target Version/s" = "2.4.7" AND status in (Open, Reopened, "In Progress"))
>
> The tag to be voted on is v2.4.7-rc3 (commit 
> 14211a19f53bd0f413396582c8970e3e0a74281d):
> https://github.com/apache/spark/tree/v2.4.7-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1361/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc3-docs/
>
> The list of bug fixes going into 2.4.7 can be found at the following URL:
> https://s.apache.org/spark-v2.4.7-rc3
>
> This release is using the release script of the tag v2.4.7-rc3.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.7?
> ===
>
> The current list of open tickets targeted at 2.4.7 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 2.4.7
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE] Release Spark 2.4.7 (RC3)

2020-09-08 Thread Prashant Sharma
Please vote on releasing the following candidate as Apache Spark
version 2.4.7.

The vote is open until Sep 11th at 9AM PST and passes if a majority +1 PMC
votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.7
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

There are currently no issues targeting 2.4.7 (try project = SPARK AND
"Target Version/s" = "2.4.7" AND status in (Open, Reopened, "In Progress"))

The tag to be voted on is v2.4.7-rc3 (commit
14211a19f53bd0f413396582c8970e3e0a74281d):
https://github.com/apache/spark/tree/v2.4.7-rc3

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc3-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1361/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc3-docs/

The list of bug fixes going into 2.4.7 can be found at the following URL:
https://s.apache.org/spark-v2.4.7-rc3

This release is using the release script of the tag v2.4.7-rc3.

FAQ


=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.7?
===

The current list of open tickets targeted at 2.4.7 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.4.7

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.