[
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948907#comment-16948907
]
Sean R. Owen commented on SPARK-28547:
--------------------------------------
I agree, this is too open-ended. It's not clear whether it's a general problem
or specific to a usage pattern, a SQL query, a data type or distribution. Often
I find that use cases for "10000 columns" are use cases for "a big array-valued
column".
I bet there is room for improvement, but, ten thousand columns is just
inherently slow given how metadata, query plans, etc are handled.
You'd at least need to help narrow down where the slow down is and why, and
even better if you can propose a class of fix. As it is I'd close this.
> Make it work for wide (> 10K columns data)
> ------------------------------------------
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 3.0.0
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per
> node, 32 cores (tried different configurations of executors)
> Reporter: antonkulaga
> Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb
> rows). Most of the genomics/transcriptomic data is wide because number of
> genes is usually >20kb and number of samples ass well. Very popular GTEX
> dataset is a good example ( see for instance RNA-Seq data at
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is
> just a .tsv file with two comments in the beginning). Everything done in wide
> tables (even simple "describe" functions applied to all the genes-columns)
> either takes hours or gets frozen (because of lost executors) irrespective of
> memory and numbers of cores. While the same operations work fast (minutes)
> and well with pure pandas (without any spark involved).
> f
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]