antonkulaga commented on SPARK-28547:

>I bet there is room for improvement, but, ten thousand columns is just 
>inherently slow given how metadata, query plans, etc are handled.
>You'd at least need to help narrow down where the slow down is and why, and 
>even better if you can propose a class of fix. As it is I'd close this.

[~srowen] I am not a spark developer, I am spark user, so I cannot say where 
the bottleneck is. If I see that doing even super-simple tasks like describe or 
like some simple transformation (like taking log out of each gene expression 
values) fail, I report it as a performance problem. As I am bioinformatician, 
most of my work is about dealing with gene expressions (thousands of samples * 
tens of thousands genes) it makes Spark unusable for me for most of the 
use-cases If operations that take seconds in pandas dataframe (without any 
spark involved) take many hours or freeze in Spark dataframe there is something 
inherently wrong how you handle the data in Spark dataframe and something you 
should investigate for Spark 3.0
If you want to narrow it down, can it be "make dataframe.describe work for 15K 
* 15K dataframe and take less than 20 minutes to complete"?

> Make it work for wide (> 10K columns data)
> ------------------------------------------
>                 Key: SPARK-28547
>                 URL: https://issues.apache.org/jira/browse/SPARK-28547
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.0.0
>         Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>            Reporter: antonkulaga
>            Priority: Critical
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f

This message was sent by Atlassian Jira

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to