Re: Spark dataframe hdfs vs s3

Jörn Franke Wed, 27 May 2020 10:34:25 -0700

Have you looked in Spark UI why this is the case ? 
S3 Reading can take more time - it depends also what s3 url you are using : s3a 
vs s3n vs S3.


It could help after some calculation to persist in-memory or on HDFS. You can 
also initially load from S3 and store on HDFS and work from there . 

HDFS offers Data locality for the tasks, ie the tasks start on the nodes where 
the data is. Depending on what s3 „protocol“ you are using you might be also 
more punished with performance.

Try s3a as a protocol (replace all s3n with s3a).

You can also use s3 url but this requires a special bucket configuration, a 
dedicated empty bucket and it lacks some ineroperability with other AWS 
services.

Nevertheless, it could be also something else with the code. Can you post an 
example reproducing the issue?

> Am 27.05.2020 um 18:18 schrieb Dark Crusader <relinquisheddra...@gmail.com>:
> 
> 
> Hi all,
> 
> I am reading data from hdfs in the form of parquet files (around 3 GB) and 
> running an algorithm from the spark ml library.
> 
> If I create the same spark dataframe by reading data from S3, the same 
> algorithm takes considerably more time.
> 
> I don't understand why this is happening. Is this a chance occurence or are 
> the spark dataframes created different? 
> 
> I don't understand how the data store would effect the algorithm performance.
> 
> Any help would be appreciated. Thanks a lot.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark dataframe hdfs vs s3

Reply via email to