Ruslan Dautkhanov created SPARK-21657:
-----------------------------------------
Summary: Spark has exponential time complexity to explode(array of
structs)
Key: SPARK-21657
URL: https://issues.apache.org/jira/browse/SPARK-21657
Project: Spark
Issue Type: Bug
Components: Spark Core, SQL
Affects Versions: 2.2.0, 2.1.1, 2.1.0, 2.0.0
Reporter: Ruslan Dautkhanov
Priority: Critical
It can take up to half a day to explode a modest-sizes nested collection (0.5m).
On a recent Xeon processors.
See attached pyspark script that reproduces this problem.
{code}
cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' +
table_name).cache()
print sqlc.count()
{code}
This script generate a number of tables, with the same total number of records
across all nested collection (see `scaling` variable in loops). `scaling`
variable scales up how many nested elements in each record, but by the same
factor scales down number of records in the table. So total number of records
stays the same.
Time grows exponentially (notice log-10 vertical axis scale).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]