Re: NoNodeAvailableException (None of the configured nodes are available) error when trying to push data to Elastic from a Spark job

2017-02-03 Thread Anastasios Zouzias
> at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfu > n$apply$28.apply(RDD.scala:902) > > at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfu > n$apply$28.apply(RDD.scala:902) > > at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkC > ontext.scala:1916) > > at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkC > ontext.scala:1916) > > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.sca > la:70) > > at org.apache.spark.scheduler.Task.run(Task.scala:86) > > at org.apache.spark.executor.Executor$TaskRunner.run(Executor. > scala:274) > -- -- Anastasios Zouzias <a...@zurich.ibm.com>

Re: I am not sure why I am getting java.lang.NoClassDefFoundError

2017-02-17 Thread Anastasios Zouzias
titionsRDD.compute( > MapPartitionsRDD.scala:38) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > > at org.apache.spark.rdd.MapPartitionsRDD.compute( > MapPartitionsRDD.scala:38) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > > at org.apache.spark.rdd.MapPartitionsRDD.compute( > MapPartitionsRDD.scala:38) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > > at org.apache.spark.rdd.MapPartitionsRDD.compute( > MapPartitionsRDD.scala:38) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > > at org.apache.spark.scheduler.ShuffleMapTask.runTask( > ShuffleMapTask.scala:79) > > at org.apache.spark.scheduler.ShuffleMapTask.runTask( > ShuffleMapTask.scala:47) > > at org.apache.spark.scheduler.Task.run(Task.scala:86) > > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > -- -- Anastasios Zouzias <a...@zurich.ibm.com>

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Anastasios Zouzias
gt; Is there anyone who knows how to implement it or any hints for it? > > Thanks in advance, > Fei > > -- -- Anastasios Zouzias <a...@zurich.ibm.com>

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Anastasios Zouzias
I just increase the numPartitions to be twice > larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps > the data locality? Do I need to define my own Partitioner? > > Thanks, > Fei > > On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zouz...@gmail.com>

Re: databricks spark-csv: linking coordinates are what?

2016-09-24 Thread Anastasios Zouzias
/quick-start.html#self-contained-applications> > > The above URL does not give me enough information so that I can link spark-csv with spark. > > Question: > How do I learn how to use the info in the Linking section of the README.md of > https://github.com/databricks/spark-csv <https://github.com/databricks/spark-csv> > ?? > -- -- Anastasios Zouzias

Re: Large-scale matrix inverse in Spark

2016-09-27 Thread Anastasios Zouzias
t: http://apache-spark-user-list. >> 1001560.n3.nabble.com/Large-scale-matrix-inverse-in-Spark-tp27796.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > - >> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > >> >> -- -- Anastasios Zouzias <a...@zurich.ibm.com>

Re: Broadcast big dataset

2016-10-01 Thread Anastasios Zouzias
Hey, Is the driver running OOM? Try 8g on the driver memory. Speaking of which, how do you estimate that your broadcasted dataset is 500M? Best, Anastasios Am 29.09.2016 5:32 AM schrieb "WangJianfei" : > First thank you very much! > My executor memeory is

Re: Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.

2016-11-26 Thread Anastasios Zouzias
t rate of spark worker nodes using iftop > and it is about 2.2KB/s (kilobytes per second) which is too low so that > tells me it not reading partitions in parallel or at very least it is not > reading good chunk of data else it would be in MB/s. Any ideas on how to > fix it? > > -- -- Anastasios Zouzias <a...@zurich.ibm.com>

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread Anastasios Zouzias
t; (686 + 2) / 24686] // What are these numbers precisely? > > > Both of these versions didn't work Spark keeps running forever and I have > been waiting for more than 15 mins and no response. Any ideas on what could > be wrong and how to fix this? > > I am using Spark 2.0.2 > and spark-cassandra-connector_2.11-2.0.0-M3.jar > > -- -- Anastasios Zouzias <a...@zurich.ibm.com>

Re: Broadcast destroy

2017-01-02 Thread Anastasios Zouzias
they be automatically pruned? > > > > Thank you, > > > > Bryan Jeffrey > > > > Sent from my Windows 10 phone > > > > > -- -- Anastasios Zouzias <a...@zurich.ibm.com>

Re: Ingesting data in elasticsearch from hdfs using spark , cluster setup and usage

2016-12-23 Thread Anastasios Zouzias
t; thoughts about tuning. > > Regards > Rohit > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- -- Anastasios Zouzias <a...@zurich.ibm.com>

Re: Spark SVD benchmark for dense matrices

2017-08-10 Thread Anastasios Zouzias
Hi Jose, Just to note that in the databricks blog they state that they compute the top-5 singular vectors, not all singular values/vectors. Computing all is much more computational intense. Cheers, Anastasios Am 09.08.2017 15:19 schrieb "Jose Francisco Saray Villamizar" < jsa...@gmail.com>:

Re: Slow responce on Solr Cloud with Spark

2017-07-19 Thread Anastasios Zouzias
e > missed a trick > > regards, > Imran > > -- > I.R > -- -- Anastasios Zouzias <a...@zurich.ibm.com>

Re: Best Practice for Enum in Spark SQL

2017-05-12 Thread Anastasios Zouzias
park-shell, Spark SQL CLI, and hive. My questions: > > 1) Should I store my Enum type as String or store it as numeric encoding > (aka 1=Car, 2=SUV, 3=Wagon)? > > 2) If I choose String, any penalty in hard drive space or memory? > > Thank you! > > Mike > -- -- Anastasios Zouzias <a...@zurich.ibm.com>

Re: Can we access files on Cluster mode

2017-06-25 Thread Anastasios Zouzias
estruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 25 June 2017 at 09:39, Ana

Re: Can we access files on Cluster mode

2017-06-25 Thread Anastasios Zouzias
Just to note that in cluster mode the spark driver might run on any node of the cluster, hence you need to make sure that the file exists on *all* nodes. Push the file on all nodes or use client deploy-mode. Best, Anastasios Am 24.06.2017 23:24 schrieb "Holden Karau" : >

Re: KMeans Clustering is not Reproducible

2017-05-22 Thread Anastasios Zouzias
hrough the Spark source code, I guess the cause is the > initialization method of KMeans which in turn uses the `takeSample` method, > which does not seem to be partition agnostic. > > Is this behaviour expected? Is there anything I could do to achieve > reproducible results? > > Best, > Christoph > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- -- Anastasios Zouzias <a...@zurich.ibm.com>

Re: compile error: No classtag available while calling RDD.zip()

2017-09-13 Thread Anastasios Zouzias
> Best regards, > bluejoe > --------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- -- Anastasios Zouzias <a...@zurich.ibm.com>

Re: ConcurrentModificationException using Kafka Direct Stream

2017-09-18 Thread Anastasios Zouzias
Hi, I had a similar issue using 2.1.0 but not with Kafka. Updating to 2.1.1 solved my issue. Can you try with 2.1.1 as well and report back? Best, Anastasios Am 17.09.2017 16:48 schrieb "HARSH TAKKAR" : Hi I am using spark 2.1.0 with scala 2.11.8, and while iterating

Re: best spark spatial lib?

2017-10-10 Thread Anastasios Zouzias
patial and logical > operators can be combined. > > regards, > Imran > > -- > I.R > -- -- Anastasios Zouzias <a...@zurich.ibm.com>

Re: Error - Spark reading from HDFS via dataframes - Java

2017-10-01 Thread Anastasios Zouzias
Hi, Set the inferschema option to true in spark-csv. you may also want to set the mode option. See readme below https://github.com/databricks/spark-csv/blob/master/README.md Best, Anastasios Am 01.10.2017 07:58 schrieb "Kanagha Kumar" : Hi, I'm trying to read data

Re: Several Aggregations on a window function

2017-12-18 Thread Anastasios Zouzias
ers,+34980+St+Cl%C3%A9ment+de+Rivi%C3%A8re=gmail=g> >> > -- > > > Julien CHAMP — Data Scientist > > > *Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email : > **jch...@tellmeplus.com > <jch...@tellmeplus.com>* > > *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* : *here* > <https://www.linkedin.com/in/julienchamp> > > TellMePlus S.A — Predictive Objects > > *Paris* : 7 rue des Pommerots, 78400 Chatou > <https://maps.google.com/?q=7+rue+des+Pommerots,+78400+Chatou=gmail=g> > *Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière > <https://maps.google.com/?q=51+impasse+des+%C3%A9glantiers,+34980+St+Cl%C3%A9ment+de+Rivi%C3%A8re=gmail=g> > > > Ce message peut contenir des informations confidentielles ou couvertes par > le secret professionnel, à l’intention de son destinataire. Si vous n’en > êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer > toute copie. > This email may contain confidential and/or privileged information for the > intended recipient. If you are not the intended recipient, please contact > the sender and delete all copies. > > > <http://www.tellmeplus.com/assets/emailing/banner.html> > -- -- Anastasios Zouzias <a...@zurich.ibm.com>

Re: Several Aggregations on a window function

2017-12-18 Thread Anastasios Zouzias
PM, Julien CHAMP <jch...@tellmeplus.com> wrote: > It seems interesting, however scalding seems to require be used outside of > spark ? > > > Le lun. 18 déc. 2017 à 17:15, Anastasios Zouzias <zouz...@gmail.com> a > écrit : > >> Hi Julien, >> >> I am

Re: Fastest way to drop useless columns

2018-05-31 Thread Anastasios Zouzias
--- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- -- Anastasios Zouzias

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread Anastasios Zouzias
.show(2) or hdf.take(1) stuck for 1.5 hrs and gives OOM > > Try 3 > Repartition it after before performing action > gives OOM > > Try 4 > Read about the https://issues.apache.org/jira/browse/SPARK-20980 > completely > val hdf = spark.read.option("multiLine", > true)..schema(sampleSchema).json("/user/tmp/hugedatafile") > hdf.show(1) or hdf.take(1) gives OOM > > > Can any one help me here? > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- -- Anastasios Zouzias

Re: Can spark handle this scenario?

2018-02-17 Thread Anastasios Zouzias
symbol: >>>>>>> >>>>>>> >>>>>>> >>>>>>> case class Symbol(symbol: String, sector: String) >>>>>>> >>>>>>> case class Tick(symbol: String, sector: String, open: Double, close: >>>>>>> Double) >>>>>>> >>>>>>> >>>>>>> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns >>>>>>> Dataset[Tick] >>>>>>> >>>>>>> >>>>>>> symbolDs.map { k => >>>>>>> >>>>>>> pullSymbolFromYahoo(k.symbol, k.sector) >>>>>>> >>>>>>> } >>>>>>> >>>>>>> >>>>>>> This statement cannot compile: >>>>>>> >>>>>>> >>>>>>> Unable to find encoder for type stored in a Dataset. Primitive >>>>>>> types (Int, String, etc) and Product types (case classes) are supported >>>>>>> by >>>>>>> importing spark.implicits._ Support for serializing other types >>>>>>> will be added in future releases. >>>>>>> >>>>>>> >>>>>>> My questions are: >>>>>>> >>>>>>> >>>>>>> 1. As you can see, this scenario is not traditional dataset handling >>>>>>> such as count, sql query... Instead, it is more like a UDF which apply >>>>>>> random operation on each record. Is Spark good at handling such >>>>>>> scenario? >>>>>>> >>>>>>> >>>>>>> 2. Regarding the compilation error, any fix? I did not find a >>>>>>> satisfactory solution online. >>>>>>> >>>>>>> >>>>>>> Thanks for help! >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best Regards, >>>>>> Ayan Guha >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Best Regards, >>>>> Ayan Guha >>>>> >>>> >>>> >> > -- -- Anastasios Zouzias <a...@zurich.ibm.com>

Re: conflicting version question

2018-10-26 Thread Anastasios Zouzias
Hi Nathan, You can try to shade the dependency version that you want to use. That said, shading is a tricky technique. Good luck. https://softwareengineering.stackexchange.com/questions/297276/what-is-a-shaded-java-dependency See also elasticsearch's discussion on shading

Re: Packaging kafka certificates in uber jar

2018-12-25 Thread Anastasios Zouzias
nt preferably the spark one? > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- -- Anastasios Zouzias

Re: Handling of watermark in structured streaming

2019-05-14 Thread Anastasios Zouzias
ect? And is there any way of bringing > "the real time" into the calculation of the watermark (short of producing > regular dummy messages which are then again filtered out). > > -- > CU, Joe > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- -- Anastasios Zouzias

Re: Spark Structured Streaming | Highly reliable de-duplication strategy

2019-05-01 Thread Anastasios Zouzias
d to KV store for storing checksum, in the case of >> unwanted >> failures. How does that guarantee exactly-once with restarts? >> >> Any suggestions are highly appreciated. >> >> >> Akshay Bhardwaj >> +91-97111-33849 >> > -- -- Anastasios Zouzias

[Structured Streaming] Robust watermarking calculation with future timestamps

2019-11-13 Thread Anastasios Zouzias
Hi all, We currently have the following issue with a Spark Structured Streaming (SS) application. The application reads messages from thousands of source systems, stores them in Kafka and Spark aggregates them using SS and watermarking (15 minutes). The root problem is that a few of the source

Re: Looping through a series of telephone numbers

2023-04-02 Thread Anastasios Zouzias
, Anastasios Zouzias On Sat, Apr 1, 2023 at 8:31 PM Philippe de Rochambeau wrote: > Hello, > I’m looking for an efficient way in Spark to search for a series of > telephone numbers, contained in a CSV file, in a data set column. > > In pseudo code, > > for tel in [tel1, tel2, …. tel40,