Re: HDFS or NFS as a cache?

2017-10-02 Thread Miguel Morales
ran [mailto:ste...@hortonworks.com] > > Sent: Saturday, September 30, 2017 6:10 AM > > To: JG Perrin > > Cc: Alexander Czech ; > user@spark.apache.org > > Subject: Re: HDFS or NFS as a cache? > > > > > > > > > > > > On 29 Sep 2017, at 20:03

Re: HDFS or NFS as a cache?

2017-10-02 Thread Marcelo Vanzin
Steve Loughran [mailto:ste...@hortonworks.com] > Sent: Saturday, September 30, 2017 6:10 AM > To: JG Perrin > Cc: Alexander Czech ; user@spark.apache.org > Subject: Re: HDFS or NFS as a cache? > > > > > > On 29 Sep 2017, at 20:03, JG Perrin wrote: > > > > Y

RE: HDFS or NFS as a cache?

2017-10-02 Thread JG Perrin
ghran [mailto:ste...@hortonworks.com] Sent: Saturday, September 30, 2017 6:10 AM To: JG Perrin Cc: Alexander Czech ; user@spark.apache.org Subject: Re: HDFS or NFS as a cache? On 29 Sep 2017, at 20:03, JG Perrin mailto:jper...@lumeris.com>> wrote: You will collect in the driver (often the mas

Re: HDFS or NFS as a cache?

2017-09-30 Thread Steve Loughran
something"; specifics aren't covered, but I assume its dynamo DB based -Steve From: Alexander Czech [mailto:alexander.cz...@googlemail.com] Sent: Friday, September 29, 2017 8:15 AM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: HDFS or NFS as a cache? I ha

Re: HDFS or NFS as a cache?

2017-09-30 Thread Steve Loughran
On 29 Sep 2017, at 15:59, Alexander Czech mailto:alexander.cz...@googlemail.com>> wrote: Yes I have identified the rename as the problem, that is why I think the extra bandwidth of the larger instances might not help. Also there is a consistency issue with S3 because of the how the rename work

RE: HDFS or NFS as a cache?

2017-09-29 Thread JG Perrin
You will collect in the driver (often the master) and it will save the data, so for saving, you will not have to set up HDFS. From: Alexander Czech [mailto:alexander.cz...@googlemail.com] Sent: Friday, September 29, 2017 8:15 AM To: user@spark.apache.org Subject: HDFS or NFS as a cache? I have

Re: HDFS or NFS as a cache?

2017-09-29 Thread Alexander Czech
Yes I have identified the rename as the problem, that is why I think the extra bandwidth of the larger instances might not help. Also there is a consistency issue with S3 because of the how the rename works so that I probably lose data. On Fri, Sep 29, 2017 at 4:42 PM, Vadim Semenov wrote: > How

Re: HDFS or NFS as a cache?

2017-09-29 Thread Vadim Semenov
How many files you produce? I believe it spends a lot of time on renaming the files because of the output committer. Also instead of 5x c3.2xlarge try using 2x c3.8xlarge instead because they have 10GbE and you can get good throughput for S3. On Fri, Sep 29, 2017 at 9:15 AM, Alexander Czech < alex

HDFS or NFS as a cache?

2017-09-29 Thread Alexander Czech
I have a small EC2 cluster with 5 c3.2xlarge nodes and I want to write parquet files to S3. But the S3 performance for various reasons is bad when I access s3 through the parquet write method: df.write.parquet('s3a://bucket/parquet') Now I want to setup a small cache for the parquet output. One o