Re: Graphframe Error

2016-07-05 Thread Felix Cheung
This could be the workaround: http://stackoverflow.com/a/36419857 On Tue, Jul 5, 2016 at 5:37 AM -0700, "Arun Patel" > wrote: Thanks Yanbo and Felix. I tried these commands on CDH Quickstart VM and also on "Spark 1.6 pre-built for

Re: spark local dir to HDFS ?

2016-07-05 Thread sri hari kali charan Tummala
thanks makes sense, can anyone answer this below question ? http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-td27264.html Thanks Sri On Tue, Jul 5, 2016 at 8:15 PM, Saisai Shao wrote: > It is not worked to configure local dirs to

Re: spark local dir to HDFS ?

2016-07-05 Thread Saisai Shao
It is not worked to configure local dirs to HDFS. Local dirs are mainly used for data spill and shuffle data persistence, it is not suitable to use hdfs. If you met capacity problem, you could configure multiple dirs located in different mounted disks. On Wed, Jul 6, 2016 at 9:05 AM, Sri

Re: spark local dir to HDFS ?

2016-07-05 Thread Sri
Hi , Space issue we are currently using /tmp and at the moment we don't have any mounted location setup yet. Thanks Sri Sent from my iPhone > On 5 Jul 2016, at 17:22, Jeff Zhang wrote: > > Any reason why you want to set this on hdfs ? > >> On Tue, Jul 5, 2016 at 3:47

Re: spark local dir to HDFS ?

2016-07-05 Thread Jeff Zhang
Any reason why you want to set this on hdfs ? On Tue, Jul 5, 2016 at 3:47 PM, kali.tumm...@gmail.com < kali.tumm...@gmail.com> wrote: > Hi All, > > can I set spark.local.dir to HDFS location instead of /tmp folder ? > > I tried setting up temp folder to HDFS but it didn't worked can >

RE: Spark MLlib: MultilayerPerceptronClassifier error?

2016-07-05 Thread Shiryaev, Mikhail
Hi Alexander, I used the same example from MLP user guide but on Java language. I modified an example a little bit and there was my nasty bug that I haven’t noticed (some inconsistence between layers and real feature count). After fixing that MLP works on my test. So here is my inadvertence,

spark local dir to HDFS ?

2016-07-05 Thread kali.tumm...@gmail.com
Hi All, can I set spark.local.dir to HDFS location instead of /tmp folder ? I tried setting up temp folder to HDFS but it didn't worked can spark.local.dir write to HDFS ? .set("spark.local.dir","hdfs://namednode/spark_tmp/") 16/07/05 15:35:47 ERROR DiskBlockManager: Failed to create local

Re: Bootstrap Action to Install Spark 2.0 on EMR?

2016-07-05 Thread Holden Karau
Just to be clear Spark 2.0 isn't released yet, there is a preview version for developers to explore and test compatibility with. That being said Roy Hasson has a blog post discussing using Spark 2.0-preview with EMR -

Working of Streaming Kmeans

2016-07-05 Thread Holden Karau
Hi Biplob, The current Streaming KMeans code only updates data which comes in through training (e.g. trainOn), predictOn does not update the model. Cheers, Holden :) P.S. Traffic on the list might be have been bit slower right now because of Canada Day and 4th of July weekend respectively.

SnappyData and Structured Streaming

2016-07-05 Thread Benjamin Kim
I recently got a sales email from SnappyData, and after reading the documentation about what they offer, it sounds very similar to what Structured Streaming will offer w/o the underlying in-memory, spill-to-disk, CRUD compliant data storage in SnappyData. I was wondering if Structured Streaming

Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Mich Talebzadeh
Well that is what the OP stated. I have a spark cluster consisting of 4 nodes in a standalone mode,.. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

RE: Spark MLlib: MultilayerPerceptronClassifier error?

2016-07-05 Thread Ulanov, Alexander
Hi Mikhail, I have followed the MLP user-guide and used the dataset and network configuration you mentioned. MLP was trained without any issues with default parameters, that is block size of 128 and 100 iterations. Source code: scala> import

Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Michael Segel
Did the OP say he was running a stand alone cluster of Spark, or on Yarn? > On Jul 5, 2016, at 10:22 AM, Mich Talebzadeh > wrote: > > Hi Jakub, > > Any reason why you are running in standalone mode, given that your are > familiar with YARN? > > In theory your

Re: Spark Dataframe validating column names

2016-07-05 Thread Scott W
Hi, Yes I tried that however, I also want to "pin down" that specific event containing invalid characters in the column names (per the parquet spec) and drop it from the df before converting it to parquet. Where I'm having trouble is my dataframe might have events with different set of fields,

Re: Spark streaming. Strict discretizing by time

2016-07-05 Thread Cody Koeninger
Test by producing messages into kafka at a rate comparable to what you expect in production. Test with backpressure turned on, it doesn't require you to specify a fixed limit on number of messages and will do its best to maintain batch timing. Or you could empirically determine a reasonable

Re: Spark streaming. Strict discretizing by time

2016-07-05 Thread rss rss
Hi, thanks. I know about possibility to limit number of messages. But the problem is I don't know number of messages which the system able to process. It depends on data. The example is very simple. I need a strict response after specified time. Something like soft real time. In case of Flink

Re: remove row from data frame

2016-07-05 Thread nihed mbarek
hi, doing multiple filters to keep data that you need. regards, On Tue, Jul 5, 2016 at 5:38 PM, pseudo oduesp wrote: > Hi , > how i can remove row from data frame verifying some condition on some > columns ? > thanks > -- M'BAREK Med Nihed, Fedora Ambassador,

remove row from data frame

2016-07-05 Thread pseudo oduesp
Hi , how i can remove row from data frame verifying some condition on some columns ? thanks

Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Mathieu Longtin
>From experience, here's the kind of things that cause the driver to run out of memory: - Way too many partitions (1 and up) - Something like this: data = load_large_data() rdd = sc.parallelize(data) - Any call to rdd.collect() or rdd.take(N) where the resulting data is bigger than driver

Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Mich Talebzadeh
Hi Jakub, Any reason why you are running in standalone mode, given that your are familiar with YARN? In theory your settings are correct. I checked your environment tab settings and they look correct. I assume you have checked this link http://spark.apache.org/docs/latest/spark-standalone.html

Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Jakub Stransky
So now that we clarified that all is submitted at cluster standalone mode what is left when the application (ML pipeline) doesn't take advantage of full cluster power but essentially running just on master node until resources are exhausted. Why training ml Decesion Tree doesn't scale to the rest

Re: Spark streaming. Strict discretizing by time

2016-07-05 Thread Cody Koeninger
If you're talking about limiting the number of messages per batch to try and keep from exceeding batch time, see http://spark.apache.org/docs/latest/configuration.html look for backpressure and maxRatePerParition But if you're only seeing zeros after your job runs for a minute, it sounds like

Spark streaming. Strict discretizing by time

2016-07-05 Thread rss rss
Hello, I'm trying to organize processing of messages from Kafka. And there is a typical case when a number of messages in kafka's queue is more then Spark app's possibilities to process. But I need a strong time limit to prepare result for at least for a part of data. Code example:

Re: Read Kafka topic in a Spark batch job

2016-07-05 Thread Cody Koeninger
If it's a batch job, don't use a stream. You have to store the offsets reliably somewhere regardless. So it sounds like your only issue is with identifying offsets per partition? Look at KafkaCluster.scala, methods getEarliestLeaderOffsets / getLatestLeaderOffsets. On Tue, Jul 5, 2016 at 7:40

Re: Standalone mode resource allocation questions

2016-07-05 Thread Jacek Laskowski
On Tue, Jul 5, 2016 at 4:18 PM, Jakub Stransky wrote: > 1) Is it possible to configure multiple executors per worker machine? Yes. > Do I understand it correctly that I specify SPARK_WORKER_MEMORY and > SPARK_WORKER_CORES which essentially describes available resources

Standalone mode resource allocation questions

2016-07-05 Thread Jakub Stransky
Hello, I went through Spark documentation and several posts from Cloudera etc and as my background is heavily on Hadoop/YARN there is a little confusion still there. Could someone more experienced clarify please? What I am trying to achieve: - Running cluster in standalone mode version 1.6.1

Having issues of passing properties to Spark in 1.5 in comparison to 1.2

2016-07-05 Thread Nkechi Achara
After using Spark 1.2 for quite a long time, I have realised that you can no longer pass spark configuration to the driver via the --conf via command line (or in my case shell script). I am thinking about using system properties and picking the config up using the following bit of code: def

Re: Spark Dataframe validating column names

2016-07-05 Thread Jacek Laskowski
Hi, What do you think of using df.columns to know the column names and process appropriately or df.schema? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Tue,

Re: Read Kafka topic in a Spark batch job

2016-07-05 Thread Bruckwald Tamás
Thanks for you answer. Unfortunately Im bound to Kafka 0.8.2.1.--Bruckwald nihed mbarek írta: >Hi, Are you using a new version of kafka ? if yessince 0.9 auto.offset.reset >parameter take :earliest: automatically reset the offset to the earliest >offsetlatest: automatically

Re: Graphframe Error

2016-07-05 Thread Arun Patel
Thanks Yanbo and Felix. I tried these commands on CDH Quickstart VM and also on "Spark 1.6 pre-built for Hadoop" version. I am still not able to get it working. Not sure what I am missing. Attaching the logs. On Mon, Jul 4, 2016 at 5:33 AM, Felix Cheung wrote:

Re: Read Kafka topic in a Spark batch job

2016-07-05 Thread nihed mbarek
Hi, Are you using a new version of kafka ? if yes since 0.9 auto.offset.reset parameter take : - earliest: automatically reset the offset to the earliest offset - latest: automatically reset the offset to the latest offset - none: throw exception to the consumer if no previous offset

Read Kafka topic in a Spark batch job

2016-07-05 Thread Bruckwald Tamás
Hello, Im writing a Spark (v1.6.0) batch job which reads from a Kafka topic. For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD however, I need to set the offsets for all the partitions and also need to store them somewhere (ZK? HDFS?) to know from where to start the next

Spark MLlib: network intensive algorithms

2016-07-05 Thread mshiryae
Hi, I have a question wrt ML algorithms. What are the most network intensive algorithms in Spark MLlib? I have already looked at ALS (as pointed here: https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html ALS is pretty communication and computation

StreamingKmeans Spark doesn't work at all

2016-07-05 Thread Biplob Biswas
Hi, I implemented the streamingKmeans example provided in the spark website but in Java. The full implementation is here, http://pastebin.com/CJQfWNvk But i am not getting anything in the output except occasional timestamps like one below: ---

Dataframe sort

2016-07-05 Thread tan shai
Hi, I need to sort a dataframe and retrive the bounds of each partition. The dataframe.sort() is using the range partitioning in the physical plan. I need to retrieve partition bounds. Many thanks for your help.

Re: Deploying ML Pipeline Model

2016-07-05 Thread Nick Pentreath
It all depends on your latency requirements and volume. 100s of queries per minute, with an acceptable latency of up to a few seconds? Yes, you could use Spark for serving, especially if you're smart about caching results (and I don't mean just Spark caching, but caching recommendation results for

Re: Deploying ML Pipeline Model

2016-07-05 Thread Nick Pentreath
Sean is correct - we now use jpmml-model (which is actually BSD 3-clause, where old jpmml was A2L, but either work) On Fri, 1 Jul 2016 at 21:40 Sean Owen wrote: > (The more core JPMML libs are Apache 2; OpenScoring is AGPL. We use > JPMML in Spark and couldn't otherwise

pyspark: dataframe.take is slow

2016-07-05 Thread immerrr again
Hi all! I'm having a strange issue with pyspark 1.6.1. I have a dataframe, df = sqlContext.read.parquet('/path/to/data') whose "df.take(10)" is really slow, apparently scanning the whole dataset to take the first ten rows. "df.first()" works fast, as does "df.rdd.take(10)". I have found

?????? Enforcing shuffle hash join

2016-07-05 Thread ??????
you can try set "spark.shuffle.manager" to "hash". this is the meaning of the parameter: Implementation to use for shuffling data. There are two implementations available:sort and hash. Sort-based shuffle is more memory-efficient and is the default option starting in 1.2. --

Re: java.io.FileNotFoundException

2016-07-05 Thread Jacek Laskowski
On Tue, Jul 5, 2016 at 2:16 AM, kishore kumar wrote: > 2016-07-04 05:11:53,972 [dispatcher-event-loop-0] ERROR > org.apache.spark.scheduler.LiveListenerBus- Dropping SparkListenerEvent > because no remaining room in event q > ueue. This likely means one of the

Re: Is that possible to launch spark streaming application on yarn with only one machine?

2016-07-05 Thread Yu Wei
Hi Deng, Thanks for the help. Actually I need pay more attention to memory usage. I found the root cause in my problem. It seemed that it existed in spark streaming MQTTUtils module. When I use "localhost" in brokerURL, it doesn't work. After change it to "127.0.0.1", it works now. Thanks

How Spark HA works

2016-07-05 Thread Akmal Abbasov
Hi,  I'm trying to understand how Spark HA works. I'm using Spark 1.6.1 and Zookeeper 3.4.6. I've add the following line to $SPARK_HOME/conf/spark-env.sh export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181,zk3:2181

Re: Is that possible to launch spark streaming application on yarn with only one machine?

2016-07-05 Thread Deng Ching-Mallete
Hi Jared, You can launch a Spark application even with just a single node in YARN, provided that the node has enough resources to run the job. It might also be good to note that when YARN calculates the memory allocation for the driver and the executors, there is an additional memory overhead

Re: How to Create a Database in Spark SQL

2016-07-05 Thread Mich Talebzadeh
it should work *spark-sql> create database somedb;*OK Time taken: 2.694 seconds *spark-sql> show databases;*OK accounts asehadoop default iqhadoop mytable_db oraclehadoop *somedb*test twitterdb Time taken: 1.277 seconds, Fetched 9 row(s) *spark-sql> use somedb;OK*Time taken: 0.059 seconds

How to Create a Database in Spark SQL

2016-07-05 Thread lokeshyadav
Hi I am very new to SparkSQL and I have a very basic question: How do I create a database or multiple databases in sparkSQL. I am executing the SQL from spark-sql CLI. The query like in hive: /create database sample_db/ does not work here. I have Hadoop 2.7 and Spark 1.6 installed on my system.

Re: Enforcing shuffle hash join

2016-07-05 Thread Lalitha MV
By setting the preferSortMergeJoin to false, it still only picks between Merge Join and Broadcast join. Does not pick shuffle hash join depending on autobroadcastthreshold's value. I went though the sparkstrategies, and doesn't look like there is a direct clean way to enforce it. On Mon, Jul 4,

Is that possible to launch spark streaming application on yarn with only one machine?

2016-07-05 Thread Yu Wei
Hi guys, I set up pseudo hadoop/yarn cluster on my labtop. I wrote a simple spark streaming program as below to receive messages with MQTTUtils. conf = new SparkConf().setAppName("Monitor"); jssc = new JavaStreamingContext(conf, Durations.seconds(1)); JavaReceiverInputDStream inputDS =