On Thu, Sep 28, 2017 at 9:02 PM, Jörn Franke <jornfra...@gmail.com> wrote:
> It looks to me a little bit strange. First json.gz files are single threaded, 
> ie each file can only be processed by one thread (so it is good to have many 
> files of around 128 MB to 512 MB size each).

Indeed. Unfortunately, the files I have to work with are quite a bit larger.

> Then what you do in the code is already done by the data source. There is no 
> need to read the file directory and parallelize. Just provide the directory 
> containing the files to the data source and Spark automatically takes care to 
> read them from different executors.

Very true. My motivation behind my contrived idea is that I need to
replicate the same file tree structure after filtering -- that does
not seems easy if I build a huge RDD from all input files.

> In order  improve write Performance check if you can store them in Avro (or 
> parquet or orc) using their internal compression feature. Then you can have 
> even many threads/file.

Indeed, 50% of my processing time is spent uploaded the results to S3.

Thank you for your input.

Jeroen

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to