[
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16894839#comment-16894839
]
Takeshi Yamamuro commented on SPARK-28547:
------------------------------------------
You need to ask in the dev mailinglist first to narrow down the issue. We can
do nothing based on the current description.
> Make it work for wide (> 10K columns data)
> ------------------------------------------
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 2.4.4, 2.4.3
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per
> node, 32 cores (tried different configurations of executors)
> Reporter: antonkulaga
> Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb
> rows). Most of the genomics/transcriptomic data is wide because number of
> genes is usually >20kb and number of samples ass well. Very popular GTEX
> dataset is a good example ( see for instance RNA-Seq data at
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is
> just a .tsv file with two comments in the beginning). Everything done in wide
> tables (even simple "describe" functions applied to all the genes-columns)
> either takes hours or gets frozen (because of lost executors) irrespective of
> memory and numbers of cores. While the same operations work fast (minutes)
> and well with pure pandas (without any spark involved).
> f
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]