Hi All,
What is the best way to determine partitions of a dataframe dynamically
before writing to disk?
1) statically determine based on data and use coalesce or repartition while
writing
2) somehow determine count of records for entire dataframe and divide that
number to determine partition - ho
Hi Hari,
Thanks :) I tried to do it as u said. It works ;)
Hariharan 于2019年5月20日 周一下午3:54写道:
> Hi Huizhe,
>
> You can set the "fs.defaultFS" field in core-site.xml to some path on s3.
> That way your spark job will use S3 for all operations that need HDFS.
> Intermediate data will still be store
Hello Spark community,
Please let me know if this is the appropriate place to ask this question – will
happily move it. I haven’t been able to find the answer going to the usual
outlets.
I am currently implementing two custom readers for our projects (JMS / SQS) and
am experiencing a problem w
Hi,
I am getting below errror when running sample strreaming app. does anyone
have resolution for this?
JSON OFFSET {"test":{"0":0,"1":0,"2":0,"3":0,"4":0,"5":0}}
- Herreee
root
|-- key: string (nullable = true)
|-- value: string (nullable = true)
|-- topic: string (nullable = true)
|--
Hi
I finally got all working. Here are the steps (for information, I am on HDP
2.6.5):
- copy the old hive-site.xml into the new spark conf folder
- (optional?) donwload the jersey-bundle-1.8.jar and put it into the jars folder
- build a tar gz from all the jars and copy that archive to hdfs wit
Ok great. I understood the ideology, thanks.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Yes.
If the job fails repeatedly (4 times in this case), Spark assumes that
there is a problem in the Job and notifies the user. In exchange for this,
the engine can go on to serve other jobs with its available resources.
I would try the following until things improve:
1. Figure out what's wrong
umm, i am not sure if I got this fully.
It is a design decision to not have context.stop() right after
awaitTermination throws exception?
So, the ideology is that if after n tries (default 4) a task fails, the
spark should fail fast and let user know? Is this correct?
As you mentioned there are
Correction: The Driver manages the Tasks, the resource manager serves up
resources to the Driver or Task.
On Tue, May 21, 2019 at 9:11 AM Jason Nerothin
wrote:
> The behavior is a deliberate design decision by the Spark team.
>
> If Spark were to "fail fast", it would prevent the system from rec
The behavior is a deliberate design decision by the Spark team.
If Spark were to "fail fast", it would prevent the system from recovering
from many classes of errors that are in principle recoverable (for example
if two otherwise unrelated jobs cause a garbage collection spike on the
same node). C
When you cache a dataframe, you actually cache a logical plan. That's why
re-creating the dataframe doesn't work: Spark finds out the logical plan is
cached and picks the cached data.
You need to uncache the dataframe, or go back to the SQL way:
df.createTempView("abc")
spark.table("abc").cache()
Hi,
Add writeStream.option("quoteMode", "NONE")
By default Spark dataset api assumes that all the values MUST BE enclosed
in quote character (def: ") while writing to CSV files.
Akshay Bhardwaj
+91-97111-33849
On Tue, May 21, 2019 at 5:34 PM 杨浩 wrote:
> We use struct streaming 2.2, when sink
We use struct streaming 2.2, when sinking as csv, a json str will automatic
add "" for it, like an element is
>
> {"hello": "world"}
result data in fs will be
> "{\"hello\": \"world\"}"
How to avoid the "",we only want
> {"hello": "world"}
code like
> resultDS.
> writeStream.
>
Hi All,
I have simply added exception handling in my code in Scala. I am
getting NoClassDefFoundError . Any leads would be appreciated.
Thanks
Kind Regards,
Sachit Murarka
Ok, I found the reason.
In my QueueStream example, I have a while(true) which keeps on adding the
RDDs, my awaitTermination call if after the while loop. Since, the while
loop never exits, awaitTermination never gets fired and never get reported
the exceptions.
The above was just the problem wit
Just to add to my previous message.
I am using Spark 2.2.2 standalone cluster manager and deploying the jobs in
cluster mode.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsu
Hi All,
I am a newbie to spark and have a quick question.
I am running Spark 2.3.2 on YARN using HDP 2.8.5 on EMR -5.19.0
Since EMR version is 5.19, dynamic allocation is set to true by default. I
haven't set the min and max executors but saw that by default it starts
with max( initial executors,
I was able to reproduce the problem.
In the below repository, I have 2 sample jobs. Both are execution 1/0
(Arithmetic Exception) on the executor sides and but in case of
NetworkWordCount job, awaitTerminate throws the same exceptions (Job aborted
due to stage failure .) that I can see in the
Hi,
I am looking for a setup that would be to be able to split a single spark
processing into 2 jobs (operational constraints) without wasting too much
time persisting the data between the two jobs during spark
checkpoint/writes.
I have a config with a lot of ram and I'm willing to configure a a
19 matches
Mail list logo