[jira] [Updated] (SPARK-33583) Query on large dataset with foreachPartitionAsync performance needs to improve

Miron (Jira) Sat, 28 Nov 2020 00:30:07 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-33583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Miron updated SPARK-33583:
--------------------------
    Summary: Query on large dataset with foreachPartitionAsync performance 
needs to improve  (was: Query on large dataset with forEachPartitionAsync 
performance needs to improve)

> Query on large dataset with foreachPartitionAsync performance needs to improve
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-33583
>                 URL: https://issues.apache.org/jira/browse/SPARK-33583
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.4
>         Environment: Spark 2.4.4
> Scala 2.11.10
>            Reporter: Miron
>            Priority: Major
>
> Repro steps:
> Load 300GB of data from JSON file into a table.
> Note in this table field ID with reasonably well sized sets, identified by 
> ID, some 50,000 rows a set.
> Issue query against this table returning DataFrame instance.
> Issue df.rdd.foreachPartitionAsync styled row harvesting.
> Place a logging line into first lambda expression, iterating over partitions 
> as a first line.
> Let's say it will read "Line #1 ( some timestamp with milliseconds )"
> Place a logging line into nested lambda expression, reading rows, such, that 
> it would run only when accessing first row.
> Let's say it will read "Line #2 ( some timestamp with milliseconds )"
> Once query completed take time difference in milliseconds between time noted 
> in logging records from line #1 and line #2 above.
> It would be fairly reasonable to assume that the time difference should be as 
> close to 0 as possible. In reality the difference is more then 1 second, 
> usually more than 2.
> This really hurts query performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33583) Query on large dataset with foreachPartitionAsync performance needs to improve

Reply via email to