On Thu, Sep 28, 2017 at 9:02 PM, Jörn Franke <jornfra...@gmail.com> wrote: > It looks to me a little bit strange. First json.gz files are single threaded, > ie each file can only be processed by one thread (so it is good to have many > files of around 128 MB to 512 MB size each).
Indeed. Unfortunately, the files I have to work with are quite a bit larger. > Then what you do in the code is already done by the data source. There is no > need to read the file directory and parallelize. Just provide the directory > containing the files to the data source and Spark automatically takes care to > read them from different executors. Very true. My motivation behind my contrived idea is that I need to replicate the same file tree structure after filtering -- that does not seems easy if I build a huge RDD from all input files. > In order improve write Performance check if you can store them in Avro (or > parquet or orc) using their internal compression feature. Then you can have > even many threads/file. Indeed, 50% of my processing time is spent uploaded the results to S3. Thank you for your input. Jeroen --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org