*my user case*: We run Spark cluster on Mesos, since our Mesos cluster is
also hosting other frameworks such as Storm, Cassandra, we had incidents
where Spark job over-utilizes CPU which caused resource contention with
other frameworks.
*objective* : run un-modularized spark application (jar is co
Yes, I can access the file using cli.
On Fri, Sep 28, 2018 at 1:24 PM kathleen li wrote:
> The error message is “file not found”
> Are you able to use the following command line to assess the file with the
> user you submitted the job?
> hdfs dfs -ls /tmp/sample.pdf
>
> Sent from my iPhone
>
> O
The error message is “file not found”
Are you able to use the following command line to assess the file with the user
you submitted the job?
hdfs dfs -ls /tmp/sample.pdf
Sent from my iPhone
> On Sep 28, 2018, at 12:10 PM, Joel D wrote:
>
> I'm trying to extract text from pdf files in hdfs usin
I'm trying to extract text from pdf files in hdfs using pdfBox.
However it throws an error:
"Exception in thread "main" org.apache.spark.SparkException: ...
java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf
(No such file or directory)"
What am I missing? Should I be working with P
Thanks for the help so far. I tried caching but the operation seems to be
taking forever. Any tips on how I can speed up this operation?
Also I am not sure case class would work, since different files have
different structures (I am parsing a 1GB file right now but there are a few
different files
Not sure I get what you mean….
I ran the query that you had – and don’t get the same hash as you.
From: Gokula Krishnan D
Date: Friday, September 28, 2018 at 10:40 AM
To: "Thakrar, Jayesh"
Cc: user
Subject: Re: [Spark SQL] why spark sql hash() are returns the same hash value
though the keys/
Hello Jayesh,
I have masked the input values with .
Thanks & Regards,
Gokula Krishnan* (Gokul)*
On Wed, Sep 26, 2018 at 2:20 PM Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:
> Cannot reproduce your situation.
>
> Can you share Spark version?
>
>
>
> Welcome to
>
>
Hi,
is there any way to read up on using spark checkpointing (programmaticly) in an in depth manner?
I have an application where I perform multiple operations on a DStream. To my understanding, the result of those Operations would create a new DStream,
which can be used for further operations.
W
How to repartition Spark DStream Kafka ConsumerRecord RDD. I am getting
uneven size of Kafka topics.. We want to repartition the input RDD based on
some logic.
But when I try to apply the repartition I am getting "object not serializable
(class: org.apache.kafka.clients.consumer.ConsumerRecor
Hi,
sorry indeed you have to cache the dataset, before the groupby (otherwise
it will be loaded at each time from disk).
For the case class you can have a look at the accepted answer here:
https://stackoverflow.com/questions/45017556/how-to-convert-a-simple-dataframe-to-a-dataset-spark-scala-with-
Thanks for the response! I'm not sure caching 'freq' would make sense, since
there are multiple columns in the file and so it will need to be different
for different columns.
Original data format is .gz (gzip).
I am a newbie to Spark, so could you please give a little more details on
the appropri
Hi,
as a first attempt I would try to cache "freq", to be sure that the dataset
is not re-loaded at each iteration later on.
Btw, what's the original data format you are importing from?
I suspect also that an appropriate case class rather than Row would help as
well, instead of converting to Stri
12 matches
Mail list logo