Re: Anyone has any experience using spark in the banking industry?

2017-01-18 Thread Georg Heiler
Have a look at mesos together with myriad I.e. Yarn on mesos. kant kodali schrieb am Mi. 18. Jan. 2017 um 22:51: > Anyone has any experience using spark in the banking industry? I have > couple of questions. > > 1. Most of the banks seem to care about number of pending

is partitionBy of DataFrameWriter supported in 1.6.x?

2017-01-18 Thread Richard Xin
I found contradictions in document 1.6.0 and 2.1.x in http://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.DataFrameWriterit says: "This is only applicable for Parquet at the moment." in

Re: Creating UUID using SparksSQL

2017-01-18 Thread Felix Cheung
spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.functions.monotonically_increasing_id ? From: Ninad Shringarpure

Re: what does dapply actually do?

2017-01-18 Thread Felix Cheung
With Spark, the processing is performed lazily. This means nothing much is really happening until you call an "action" - an example that is collect(). Another way is to write the output in a distributed manner - see write.df() in R. With SparkR dapply() passing the data from Spark to R to

Re: Spark #cores

2017-01-18 Thread Palash Gupta
Hi, I think I faced the same problem for Spark 2.1.0 when I tried to define number of executors from SparkConf ot SparkSession builder in a standalone cluster. Always it is taking all available core. There are three ways to do it: 1. Define spark.executor.cores in conf/spark-defaults.conf and

Re: Spark #cores

2017-01-18 Thread Saliya Ekanayake
Thank you, Daniel and Yong! On Wed, Jan 18, 2017 at 4:56 PM, Daniel Siegmann < dsiegm...@securityscorecard.io> wrote: > I am not too familiar with Spark Standalone, so unfortunately I cannot > give you any definite answer. I do want to clarify something though. > > The properties

Re: Spark Job hangs when one of the nodemanager goes down

2017-01-18 Thread KumarP
Got a pointer from here - https://issues.apache.org/jira/browse/SPARK-17644 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Job-hangs-when-one-of-the-nodemanager-goes-down-tp28320p28321.html Sent from the Apache Spark User List mailing list archive at

[SparkStreaming] SparkStreaming not allowing to do parallelize within a transform operation to generate a new RDD

2017-01-18 Thread Nipun Arora
Hi, I am trying to transform an RDD in a dstream by adding changing the log with the maximum timestamp, and adding a duplicate copy of it with some modifications. The following is the example code: JavaDStream logMessageWithHB = logMessageMatched.transform(new Function() {

Re: Spark #cores

2017-01-18 Thread Daniel Siegmann
I am not too familiar with Spark Standalone, so unfortunately I cannot give you any definite answer. I do want to clarify something though. The properties spark.sql.shuffle.partitions and spark.default.parallelism affect how your data is split up, which will determine the *total* number of tasks,

Anyone has any experience using spark in the banking industry?

2017-01-18 Thread kant kodali
Anyone has any experience using spark in the banking industry? I have couple of questions. 1. Most of the banks seem to care about number of pending transaction at any given time and I wonder if this is processing time or event time? I am just trying to understand how this is normally done in the

Spark Job hangs when one of the nodemanager goes down

2017-01-18 Thread KumarP
While running the streaming job on a yarn cluster, one of the node was brought down for a firmware upgrade, so we expected spark to either a) fail job completely or b) fail that task, move it to another executor, and succeed.But what we notice is is c) job hangs till end of time stack trace:

Re: Spark vs MongoDB: saving DataFrame to db raises missing database name exception

2017-01-18 Thread Marco Mistroni
Thansk Palash, ur suggestion put me on the right track Reading works fine, however it seems that in writng, as the sparkSession is not involved, then the connector does not know where to write had to replace my writing code with this MongoSpark.save(df.write.option("spark.mongodb.output.uri",

SparkStreaming add Max Line of each RDD throwing an exception

2017-01-18 Thread Nipun Arora
Please note: I have asked the following question in stackoverflow as well http://stackoverflow.com/questions/41729451/adding-to-spark-streaming-dstream-rdd-the-max-line-of-each-rdd I am trying to add to each RDD in a JavaDStream the line with the maximum timestamp, with some modification.

Parsing RDF data with Spark

2017-01-18 Thread Md. Rezaul Karim
Hi All, Is there any way to parse Linked Data in RDF(.n3,. ttl, .nq,. nt) format with Spark? Kind regards, Reza

Re: New runtime exception after switch to Spark 2.1.0

2017-01-18 Thread mhornbech
For anyone revisiting this at a later point, the issue was that Spark 2.1.0 upgrades netty to version 4.0.42 which is not binary compatible with version 4.0.37 used by version 3.1.0 of the Cassandra Java Driver. The newer version can work with Cassandra, but because of differences in the maven

Re: Spark #cores

2017-01-18 Thread Yong Zhang
Tried it first, to see if it indeed changes the parallelism you want to control in the pageRank you are running. Starting it with the # of cores you want to give to your job, increasing it when your job fails due to GC OOM. Yong From: Saliya Ekanayake

Re: Spark #cores

2017-01-18 Thread Saliya Ekanayake
So, I should be using spark.sql.shuffle.partitions to control the parallelism? Is there there a guide to how to tune this? Thank you, Saliya On Wed, Jan 18, 2017 at 2:01 PM, Yong Zhang wrote: > spark.sql.shuffle.partitions is not only controlling of the Spark SQL, but >

Sorting each partitions and writing to CSVs

2017-01-18 Thread Ivan Gozali
Hello, I have a use case that seems relatively simple to solve using Spark, but can't seem to figure out a sure way to do this. I have a dataset which contains time series data for various users. All I'm looking to do is: - partition this dataset by user ID - sort the time series data for

what does dapply actually do?

2017-01-18 Thread Xiao Liu1
Hi, I'm really new and trying to learn sparkR. I have defined a relatively complicated user-defined function, and use dapply() to apply the function on a SparkDataFrame. It was very fast. But I am not sure what has actually been done by dapply(). Because when I used collect() to see the output,

Re: Spark / Elasticsearch Error: Maybe ES was overloaded? How to throttle down Spark as it writes to ES

2017-01-18 Thread Russell Jurney
Thanks to you both, I'll check these things out. Russ On Wed, Jan 18, 2017 at 1:07 AM Shiva Ramagopal wrote: > Probably using a queue like RabbitMQ between Spark and ES could help - to > buffer the Spark output when ES can't keep up. > > Some links: > > 1. ES-RabbitMQ River

Creating UUID using SparksSQL

2017-01-18 Thread Ninad Shringarpure
Hi Team, Is there a standard way of generating a unique id for each row in from Spark SQL. I am looking for functionality similar to UUID generation in hive. Let me know if you need any additional information. Thanks, Ninad

Re: Spark #cores

2017-01-18 Thread Yong Zhang
spark.sql.shuffle.partitions is not only controlling of the Spark SQL, but also in any implementation based on Spark DataFrame. If you are using "spark.ml" package, then most ML libraries in it are based on DataFrame. So you shouldn't use "spark.default.parallelism", instead of

Re: Spark #cores

2017-01-18 Thread Saliya Ekanayake
The Spark version I am using is 2.10. The language is Scala. This is running in standalone cluster mode. Each worker is able to use all physical CPU cores in the cluster as is the default case. I was using the following parameters to spark-submit --conf spark.executor.cores=1 --conf

Re: Spark #cores

2017-01-18 Thread Palash Gupta
Hi, Can you please share how you are assigning cpu core & tell us spark version and language you are using? //Palash Sent from Yahoo Mail on Android On Wed, 18 Jan, 2017 at 10:16 pm, Saliya Ekanayake wrote: Thank you, for the quick response. No, this is not Spark SQL.

ApacheCon CFP closing soon (11 February)

2017-01-18 Thread Rich Bowen
Hello, fellow Apache enthusiast. Thanks for your participation, and interest in, the projects of the Apache Software Foundation. I wanted to remind you that the Call For Papers (CFP) for ApacheCon North America, and Apache: Big Data North America, closes in less than a month. If you've been

Re: Spark #cores

2017-01-18 Thread Saliya Ekanayake
Thank you, for the quick response. No, this is not Spark SQL. I am running the built-in PageRank. On Wed, Jan 18, 2017 at 10:33 AM, wrote: > Are you talking here of Spark SQL ? > > If yes, spark.sql.shuffle.partitions needs to be changed. > > > > *From:* Saliya

Triplet vertices differ from vertices

2017-01-18 Thread Lyndon Ollar
Hello all, I have a graph algorithm where I am implementing Louvain Modularity. The implementation has the same premise that PageRank.runWithOptions does, in that it has a graph as a variable, iterates, and overwrites this var from time to time and I have followed some of their techniques there

Re: Assembly for Kafka >= 0.10.0, Spark 2.2.0, Scala 2.11

2017-01-18 Thread Cody Koeninger
Spark 2.2 hasn't been released yet, has it? Python support in kafka dstreams for 0.10 is probably never, there's a jira ticket about this. Stable, hard to say. It was quite a few releases before 0.8 was marked stable, even though it underwent little change. On Wed, Jan 18, 2017 at 2:21 AM,

RE: Spark #cores

2017-01-18 Thread jasbir.sing
Are you talking here of Spark SQL ? If yes, spark.sql.shuffle.partitions needs to be changed. From: Saliya Ekanayake [mailto:esal...@gmail.com] Sent: Wednesday, January 18, 2017 8:56 PM To: User Subject: Spark #cores Hi, I am running a Spark application setting the

Spark #cores

2017-01-18 Thread Saliya Ekanayake
Hi, I am running a Spark application setting the number of executor cores 1 and a default parallelism of 32 over 8 physical nodes. The web UI shows it's running on 200 cores. I can't relate this number to the parameters I've used. How can I control the parallelism in a more deterministic way?

Do jobs fail because of other users of a cluster?

2017-01-18 Thread David Frese
Hello everybody, being quite new to Spark, I am struggling a lot with OutOfMemory exceptions and "GC overhead limit reached" failures of my jobs, submitted from a spark-shell and "master yarn". Playing with --num-executors, --executor-memory and --executor-cores I occasionally get something

Re: Accumulators and Datasets

2017-01-18 Thread Sean Owen
Accumulators aren't related directly to RDDs or Datasets. They're a separate construct. You can imagine updating accumulators in any distributed operation that you see documented for RDDs or Datasets. On Wed, Jan 18, 2017 at 2:16 PM Hanna Mäki wrote: > Hi, > > The

Accumulators and Datasets

2017-01-18 Thread Hanna Mäki
Hi, The documentation (http://spark.apache.org/docs/latest/programming-guide.html#accumulators) describes how to use accumulators with RDDs, but I'm wondering if and how I can use accumulators with the Dataset API. BR, Hanna -- View this message in context:

Re: apache-spark doesn't work correktly with russian alphabet

2017-01-18 Thread Sergey B.
​Try to make encoding right. E.g,, if you read from `csv` or other sources, specify encoding, which is most probably `cp1251`: df = sqlContext.read.csv(filePath, encoding="cp1251") On Linux cli encoding can be found with `chardet` utility​ On Wed, Jan 18, 2017 at 3:53 PM, AlexModestov

apache-spark doesn't work correktly with russian alphabet

2017-01-18 Thread AlexModestov
I want to use Apache Spark for working with text data. There are some Russian symbols but Apache Spark shows me strings which look like as "...\u0413\u041e\u0420\u041e...". What should I do for correcting them. -- View this message in context:

Re: Spark / Elasticsearch Error: Maybe ES was overloaded? How to throttle down Spark as it writes to ES

2017-01-18 Thread Shiva Ramagopal
Probably using a queue like RabbitMQ between Spark and ES could help - to buffer the Spark output when ES can't keep up. Some links: 1. ES-RabbitMQ River - https://github.com/elastic/elasticsearch-river-rabbitmq/blob/master/README.md 2. Using RabbitMQ with ELK -

Assembly for Kafka >= 0.10.0, Spark 2.2.0, Scala 2.11

2017-01-18 Thread Karamba
|Hi, I am looking for an assembly for Spark 2.2.0 with Scala 2.11. I can't find one in MVN Repository. Moreover, "org.apache.spark" %% "spark-streaming-kafka-0-10_2.11" % "2.1.0 shows that even sbt does not find one: [error] (*:update) sbt.ResolveException: unresolved dependency: