[ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16897194#comment-16897194
 ] 

antonkulaga commented on SPARK-28547:
-------------------------------------

[~maropu] I think I was quite clear: even describe works slow as hell. So the 
easiest way to reproduce is just to run describe on all numeric columns in 
GTEX. 

> Make it work for wide (> 10K columns data)
> ------------------------------------------
>
>                 Key: SPARK-28547
>                 URL: https://issues.apache.org/jira/browse/SPARK-28547
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.4, 2.4.3
>         Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>            Reporter: antonkulaga
>            Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to