[
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245240#comment-16245240
]
Ruslan Dautkhanov commented on SPARK-21657:
-------------------------------------------
Thank you [~uzadude] - great investigative work.
Would be great if this patch can make it to the 2.3 release.
> Spark has exponential time complexity to explode(array of structs)
> ------------------------------------------------------------------
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, SQL
> Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0
> Reporter: Ruslan Dautkhanov
> Labels: cache, caching, collections, nested_types, performance,
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG,
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' +
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of
> records across all nested collection (see `scaling` variable in loops).
> `scaling` variable scales up how many nested elements in each record, but by
> the same factor scales down number of records in the table. So total number
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]