Daniel
Can you elaborate why are you using a broadcast variable to concatenate
many Avro files into a single ORC file. Look at wholetextfiles on Spark
context.
SparkContext.wholeTextFiles lets you read a directory containing multiple
small text files, and returns each of them as (filename,
I Agree but it's a constraint I have to deal with.
The idea is load these files and merge them into ORC.
When using hive on Tez it takes less than a minute.
Daniel
> On 22 בספט׳ 2015, at 16:00, Jonathan Coveney wrote:
>
> having a file per record is pretty inefficient on
Your performance problem sounds like in the driver, which is trying to
boardcast 10k files by itself alone, which becomes the bottle neck.
What you wants is just transfer the data from AVRO format per file to another
format. In MR, most likely each mapper process one file, and you utilized the
having a file per record is pretty inefficient on almost any file system
El martes, 22 de septiembre de 2015, Daniel Haviv <
daniel.ha...@veracity-group.com> escribió:
> Hi,
> We are trying to load around 10k avro files (each file holds only one
> record) using spark-avro but it takes over 15