Run Spark on Java 10

2018-09-28 Thread Ben_W
*my user case*: We run Spark cluster on Mesos, since our Mesos cluster is also hosting other frameworks such as Storm, Cassandra, we had incidents where Spark job over-utilizes CPU which caused resource contention with other frameworks. *objective* : run un-modularized spark application (jar is co

Re: Text from pdf spark

2018-09-28 Thread Joel D
Yes, I can access the file using cli. On Fri, Sep 28, 2018 at 1:24 PM kathleen li wrote: > The error message is “file not found” > Are you able to use the following command line to assess the file with the > user you submitted the job? > hdfs dfs -ls /tmp/sample.pdf > > Sent from my iPhone > > O

Re: Text from pdf spark

2018-09-28 Thread kathleen li
The error message is “file not found” Are you able to use the following command line to assess the file with the user you submitted the job? hdfs dfs -ls /tmp/sample.pdf Sent from my iPhone > On Sep 28, 2018, at 12:10 PM, Joel D wrote: > > I'm trying to extract text from pdf files in hdfs usin

Text from pdf spark

2018-09-28 Thread Joel D
I'm trying to extract text from pdf files in hdfs using pdfBox. However it throws an error: "Exception in thread "main" org.apache.spark.SparkException: ... java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf (No such file or directory)" What am I missing? Should I be working with P

Re: Need to convert Dataset to HashMap

2018-09-28 Thread rishmanisation
Thanks for the help so far. I tried caching but the operation seems to be taking forever. Any tips on how I can speed up this operation? Also I am not sure case class would work, since different files have different structures (I am parsing a 1GB file right now but there are a few different files

Re: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

2018-09-28 Thread Thakrar, Jayesh
Not sure I get what you mean…. I ran the query that you had – and don’t get the same hash as you. From: Gokula Krishnan D Date: Friday, September 28, 2018 at 10:40 AM To: "Thakrar, Jayesh" Cc: user Subject: Re: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/

Re: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

2018-09-28 Thread Gokula Krishnan D
Hello Jayesh, I have masked the input values with . Thanks & Regards, Gokula Krishnan* (Gokul)* On Wed, Sep 26, 2018 at 2:20 PM Thakrar, Jayesh < jthak...@conversantmedia.com> wrote: > Cannot reproduce your situation. > > Can you share Spark version? > > > > Welcome to > >

Spark checkpointing

2018-09-28 Thread katze maus
Hi, is there any way to read up on using spark checkpointing (programmaticly) in an in depth manner? I have an application where I perform multiple operations on a DStream. To my understanding, the result of those Operations would create a new DStream, which can be used for further operations. W

How to repartition Spark DStream Kafka ConsumerRecord RDD.

2018-09-28 Thread Alchemist
 How to repartition Spark DStream Kafka ConsumerRecord RDD.  I am getting uneven size of Kafka topics.. We want to repartition the input RDD based on some logic.  But when I try to apply the repartition I am getting "object not serializable (class: org.apache.kafka.clients.consumer.ConsumerRecor

Re: Need to convert Dataset to HashMap

2018-09-28 Thread Alessandro Solimando
Hi, sorry indeed you have to cache the dataset, before the groupby (otherwise it will be loaded at each time from disk). For the case class you can have a look at the accepted answer here: https://stackoverflow.com/questions/45017556/how-to-convert-a-simple-dataframe-to-a-dataset-spark-scala-with-

Re: Need to convert Dataset to HashMap

2018-09-28 Thread rishmanisation
Thanks for the response! I'm not sure caching 'freq' would make sense, since there are multiple columns in the file and so it will need to be different for different columns. Original data format is .gz (gzip). I am a newbie to Spark, so could you please give a little more details on the appropri

Re: Need to convert Dataset to HashMap

2018-09-28 Thread Alessandro Solimando
Hi, as a first attempt I would try to cache "freq", to be sure that the dataset is not re-loaded at each iteration later on. Btw, what's the original data format you are importing from? I suspect also that an appropriate case class rather than Row would help as well, instead of converting to Stri