Re: Spark dataframe hdfs vs s3

Jörn Franke Fri, 29 May 2020 22:19:04 -0700

Maybe some aws network optimized instances with higher bandwidth will improve 
the situation.


> Am 27.05.2020 um 19:51 schrieb Dark Crusader <relinquisheddra...@gmail.com>:
> 
> 
> Hi Jörn,
> 
> Thanks for the reply. I will try to create a easier example to reproduce the 
> issue.
> 
> I will also try your suggestion to look into the UI. Can you guide on what I 
> should be looking for? 
> 
> I was already using the s3a protocol to compare the times.
> 
> My hunch is that multiple reads from S3 are required because of improper 
> caching of intermediate data. And maybe hdfs is doing a better job at this. 
> Does this make sense?
> 
> I would also like to add that we built an extra layer on S3 which might be 
> adding to even slower times.
> 
> Thanks for your help.
> 
>> On Wed, 27 May, 2020, 11:03 pm Jörn Franke, <jornfra...@gmail.com> wrote:
>> Have you looked in Spark UI why this is the case ? 
>> S3 Reading can take more time - it depends also what s3 url you are using : 
>> s3a vs s3n vs S3.
>> 
>> It could help after some calculation to persist in-memory or on HDFS. You 
>> can also initially load from S3 and store on HDFS and work from there . 
>> 
>> HDFS offers Data locality for the tasks, ie the tasks start on the nodes 
>> where the data is. Depending on what s3 „protocol“ you are using you might 
>> be also more punished with performance.
>> 
>> Try s3a as a protocol (replace all s3n with s3a).
>> 
>> You can also use s3 url but this requires a special bucket configuration, a 
>> dedicated empty bucket and it lacks some ineroperability with other AWS 
>> services.
>> 
>> Nevertheless, it could be also something else with the code. Can you post an 
>> example reproducing the issue?
>> 
>> > Am 27.05.2020 um 18:18 schrieb Dark Crusader 
>> > <relinquisheddra...@gmail.com>:
>> > 
>> > 
>> > Hi all,
>> > 
>> > I am reading data from hdfs in the form of parquet files (around 3 GB) and 
>> > running an algorithm from the spark ml library.
>> > 
>> > If I create the same spark dataframe by reading data from S3, the same 
>> > algorithm takes considerably more time.
>> > 
>> > I don't understand why this is happening. Is this a chance occurence or 
>> > are the spark dataframes created different? 
>> > 
>> > I don't understand how the data store would effect the algorithm 
>> > performance.
>> > 
>> > Any help would be appreciated. Thanks a lot.

Re: Spark dataframe hdfs vs s3

Reply via email to