[ https://issues.apache.org/jira/browse/SPARK-33583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Miron updated SPARK-33583: -------------------------- Summary: Query on large dataset with foreachPartitionAsync performance needs to improve (was: Query on large dataset with forEachPartitionAsync performance needs to improve) > Query on large dataset with foreachPartitionAsync performance needs to improve > ------------------------------------------------------------------------------ > > Key: SPARK-33583 > URL: https://issues.apache.org/jira/browse/SPARK-33583 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.4 > Environment: Spark 2.4.4 > Scala 2.11.10 > Reporter: Miron > Priority: Major > > Repro steps: > Load 300GB of data from JSON file into a table. > Note in this table field ID with reasonably well sized sets, identified by > ID, some 50,000 rows a set. > Issue query against this table returning DataFrame instance. > Issue df.rdd.foreachPartitionAsync styled row harvesting. > Place a logging line into first lambda expression, iterating over partitions > as a first line. > Let's say it will read "Line #1 ( some timestamp with milliseconds )" > Place a logging line into nested lambda expression, reading rows, such, that > it would run only when accessing first row. > Let's say it will read "Line #2 ( some timestamp with milliseconds )" > Once query completed take time difference in milliseconds between time noted > in logging records from line #1 and line #2 above. > It would be fairly reasonable to assume that the time difference should be as > close to 0 as possible. In reality the difference is more then 1 second, > usually more than 2. > This really hurts query performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org