I think you can collect the results in driver through toLocalIterator
method of RDD and save the result to the driver program; rather than
writing it to the file on the local disk and collecting it separately. If
your data is small enough and if you have enough cores/memory try
processing
On 10 Aug 2017, at 09:51, Hemanth Gudela
> wrote:
Yeah, installing HDFS in our environment is unfornutately going to take lot of
time (approvals/planning etc). I will have to live with local FS for now.
The other option I had
Yeah, installing HDFS in our environment is unfornutately going to take lot of
time (approvals/planning etc). I will have to live with local FS for now.
The other option I had already tried is collect() and send everything to driver
node. But my data volume is too huge for driver node to handle
Also, why are you trying to write results locally if you're not using a
distributed file system ? Spark is geared towards writing to a distributed file
system. I would suggest trying to collect() so the data is sent to the master
and then do a write if the result set isn't too big, or
Yes, I have tried with file:/// and the fullpath, as well as just the full path
without file:/// prefix.
Spark session has been closed, no luck though ☹
Regards,
Hemanth
From: Femi Anthony
Date: Thursday, 10 August 2017 at 11.06
To: Hemanth Gudela
Is your filePath prefaced with file:/// and the full path or is it relative ?
You might also try calling close() on the Spark context or session the end of
the program execution to try and ensure that cleanup is completed
Sent from my iPhone
> On Aug 10, 2017, at 3:58 AM, Hemanth Gudela
Thanks for reply Femi!
I’m writing the file like this -->
myDataFrame.write.mode("overwrite").csv("myFilePath")
There absolutely are no errors/warnings after the write.
_SUCCESS file is created on master node, but the problem of _temporary is
noticed only on worked nodes.
I know
Normally the* _temporary* directory gets deleted as part of the cleanup
when the write is complete and a SUCCESS file is created. I suspect that
the writes are not properly completed. How are you specifying the write ?
Any error messages in the logs ?
On Thu, Aug 10, 2017 at 3:17 AM, Hemanth
Hi,
I’m running spark on cluster mode containing 4 nodes, and trying to write CSV
files to node’s local path (not HDFS).
I’m spark.write.csv to write CSV files.
On master node:
spark.write.csv creates a folder with csv file name and writes many files with
part-r-000n suffix. This is okay for