Your performance problem sounds like in the driver, which is trying to 
boardcast 10k files by itself alone, which becomes the bottle neck.
What you wants is just transfer the data from AVRO format per file to another 
format. In MR, most likely each mapper process one file, and you utilized the 
whole cluster, instead of just using the Driver in MR.
Not sure exactly how to help you, but to do that in the Spark:
1) Disable the boardcast from the driver, let the each task in the Spark to 
process one file. Maybe use something like hadoop NLineInputFormat, which 
including all the filenames of your data, so each Spark task will receive the 
HDFS location of each file, then start the transform logic. In this case, you 
concurrently transform all your small files by using all the available cores of 
your executors.2) If above sounds too complex, you need to find the way to 
disable boardcasting small files from the Spark Driver. This sounds like a good 
normal way to handle small files, but I cannot find a configuration to force 
spark disable it.
Yong

From: daniel.ha...@veracity-group.com
Subject: Re: spark-avro takes a lot time to load thousands of files
Date: Tue, 22 Sep 2015 16:54:26 +0300
CC: user@spark.apache.org
To: jcove...@gmail.com

I Agree but it's a constraint I have to deal with.The idea is load these files 
and merge them into ORC.When using hive on Tez it takes less than a minute. 
Daniel
On 22 בספט׳ 2015, at 16:00, Jonathan Coveney <jcove...@gmail.com> wrote:

having a file per record is pretty inefficient on almost any file system

El martes, 22 de septiembre de 2015, Daniel Haviv 
<daniel.ha...@veracity-group.com> escribió:
Hi,We are trying to load around 10k avro files (each file holds only one 
record) using spark-avro but it takes over 15 minutes to load.It seems that 
most of the work is being done at the driver where it created a broadcast 
variable for each file.
Any idea why is it behaving that way ?Thank you.Daniel

                                          

Reply via email to