Re: Windowed Operations

2015-06-01 Thread DMiner
I also met the same issue. Any updates on this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Windowed-Operations-tp15133p23094.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Create dataframe from saved objectfile RDD

2015-06-01 Thread bipin
Hi, what is the method to create ddf from an RDD which is saved as objectfile. I don't have a java object but a structtype I want to use as schema for ddf. How to load the objectfile without the object. I tried retrieving as Row val myrdd =

Re: SparkSQL can't read S3 path for hive external table

2015-06-01 Thread Michael Armbrust
This sounds like a problem that was fixed in Spark 1.3.1. https://issues.apache.org/jira/browse/SPARK-6351 On Mon, Jun 1, 2015 at 5:44 PM, Akhil Das ak...@sigmoidanalytics.com wrote: This thread

Re: Execption writing on two cassandra tables NoHostAvailableException: All host(s) tried for query failed (no host was tried)

2015-06-01 Thread Helena Edelson
Hi Antonio, First, what version of the Spark Cassandra Connector are you using? You are using Spark 1.3.1, which the Cassandra connector today supports in builds from the master branch only - the release with public artifacts supporting Spark 1.3.1 is coming soon ;) Please see

Re: Windows of windowed streams not displaying the expected results

2015-06-01 Thread DMiner
Yes, I also met this issue. And wanna check if you fixed this issue or do you have other solution for the same goal. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-tp466p23096.html Sent from

Don't understand schedule jobs within an Application

2015-06-01 Thread bit1...@163.com
Hi, sparks, Following is copied from the spark online document http://spark.apache.org/docs/latest/job-scheduling.html. Basically, I have two questions on it: 1. If two jobs in an application has dependencies, that is one job depends on the result of the other job, then I think they will

Cassanda example

2015-06-01 Thread Yasemin Kaya
Hi, I want to write my RDD to Cassandra database and I took an example from this site http://www.datastax.com/dev/blog/accessing-cassandra-from-spark-in-java. I add that to my project but I have errors. Here is my project in gist https://gist.github.com/yaseminn/aba86dad9a3e6d6a03dc. errors :

Re: SparkSQL can't read S3 path for hive external table

2015-06-01 Thread Akhil Das
This thread http://stackoverflow.com/questions/24048729/how-to-read-input-from-s3-in-a-spark-streaming-ec2-cluster-application has various methods on accessing S3 from spark, it might help you. Thanks Best Regards On Sun, May 24, 2015 at 8:03 AM, ogoh oke...@gmail.com wrote: Hello, I am

RE: FW: Websphere MQ as a data source for Apache Spark Streaming

2015-06-01 Thread Chaudhary, Umesh
Thanks for your suggestion. Yes by Dstream.SaveAsTextFile(); I was doing a mistake by writing StorageLevel.NULL while overriding the storageLevel method in my custom receiver. When I changed it to StorageLevel.MEMORY_AND_DISK_2() , data started to save at disk. Now it’s running without any

RE: Spark Executor Memory Usage

2015-06-01 Thread HuS . Andy
#1 I not sure if I got you point, as I known, Xmx is not turn into physical memory as soon as the process running. it first loaded into virtual memory, if you heap is need more, it will gradually increase in physical memory until to the max heap. #2 Physical memory contains not only heap, but

Re: RDD boundaries and triggering processing using tags in the data

2015-06-01 Thread Akhil Das
May be you can make use of the Window operations https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#window-operations, Also another approach would be to keep your incoming data in Hbase/Redis/Cassandra kind of database and then whenever you need to average it, you just query the

Re: Cassanda example

2015-06-01 Thread Akhil Das
Here's a more detailed documentation https://github.com/datastax/spark-cassandra-connector from Datastax, You can also shoot an email directly to their mailing list http://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user since its more related to their code. Thanks Best

Streaming K-medoids

2015-06-01 Thread Marko Dinic
Hello everyone, I have an idea and I would like to get a validation from community about it. In Mahout there is an implementation of Streaming K-means. I'm interested in your opinion would it make sense to make a similar implementation of Streaming K-medoids? K-medoids has even bigger

Re: Streaming K-medoids

2015-06-01 Thread Erik Erlandson
I haven't given any thought to streaming it, but in case it's useful I do have a k-medoids implementation for Spark: http://silex.freevariable.com/latest/api/#com.redhat.et.silex.cluster.KMedoids Also a blog post about multi-threading it:

Event Logging to HDFS on Standalone Cluster In Progress

2015-06-01 Thread Richard Marscher
Hi, In Spark 1.3.0 I've enabled event logging to write to an existing HDFS folder on a Standalone cluster. This is generally working, all the logs are being written. However, from the Master Web UI, the vast majority of completed applications are labeled as not having a history:

Re: Spark stages very slow to complete

2015-06-01 Thread ayan guha
Would you mind posting the code? On 2 Jun 2015 00:53, Karlson ksonsp...@siberie.de wrote: Hi, In all (pyspark) Spark jobs, that become somewhat more involved, I am experiencing the issue that some stages take a very long time to complete and sometimes don't at all. This clearly correlates

UNSUBSCRIBE

2015-06-01 Thread Rivera, Dario
UNSUBSCRIBE - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

RE: Don't understand schedule jobs within an Application

2015-06-01 Thread yana
1. Yes if two tasks depend on each other they cant parallelize 2. Imagine something like a web application driver. You only get to have 1 spark context but now you want to run many concurrent jobs. They have nothing 2 do with each other; no reason to keep them sequential.  Hope this helps

Spark stages very slow to complete

2015-06-01 Thread Karlson
Hi, In all (pyspark) Spark jobs, that become somewhat more involved, I am experiencing the issue that some stages take a very long time to complete and sometimes don't at all. This clearly correlates with the size of my input data. Looking at the stage details for one such stage, I am

Re: FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-06-01 Thread ๏̯͡๏
I am seeing the same issue with Spark 1.3.1. I see this issue when reading sequence file stored in Sequence File format (SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'org.apache.hadoop.io.compress.GzipCodec?v? ) All i do is sc.sequenceFile(dwTable, classOf[Text],

java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-01 Thread ๏̯͡๏
Any suggestions ? I using Spark 1.3.1 to read sequence file stored in Sequence File format (SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'org.apache.hadoop.io.compress.GzipCodec?v? ) with this code and settings sc.sequenceFile(dwTable, classOf[Text], classOf[Text]).partitionBy(new

Re: Execption writing on two cassandra tables NoHostAvailableException: All host(s) tried for query failed (no host was tried)

2015-06-01 Thread Antonio Giambanco
:D very happy Helena I'll check tomorrow morning A G Il giorno 01/giu/2015, alle ore 19:45, Helena Edelson helena.edel...@datastax.com ha scritto: Hi Antonio, It’s your lucky day ;) We just released Spark Cassandra Connector 1.3.0-M1 for Spark 1.3 and DataSources API Give it a little

RE: Need some Cassandra integration help

2015-06-01 Thread Mohammed Guller
Hi Yana, Not sure whether you already solved this issue. As far as I know, the DataFrame support in Spark Cassandra connector was added in version 1.3. The first milestone release of SCC v1.3 was just announced. Mohammed From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Sent: Tuesday, May

Re: SparkSQL can't read S3 path for hive external table

2015-06-01 Thread Okehee Goh
Thanks, Michael and Akhil. Yes, it worked with Spark 1.3.1 along with AWS EMR AMI 3.7. Sorry I didn't update the status. On Mon, Jun 1, 2015 at 5:17 AM, Michael Armbrust mich...@databricks.com wrote: This sounds like a problem that was fixed in Spark 1.3.1.

flatMap output on disk / flatMap memory overhead

2015-06-01 Thread octavian.ganea
Hi, Is there any way to force the output RDD of a flatMap op to be stored in both memory and disk as it is computed ? My RAM would not be able to fit the entire output of flatMap, so it really needs to starts using disk after the RAM gets full. I didn't find any way to force this. Also, what

Re: Execption writing on two cassandra tables NoHostAvailableException: All host(s) tried for query failed (no host was tried)

2015-06-01 Thread Helena Edelson
Hi Antonio, It’s your lucky day ;) We just released Spark Cassandra Connector 1.3.0-M1 for Spark 1.3 and DataSources API Give it a little while to propagate to http://search.maven.org/#search%7Cga%7C1%7Cspark-cassandra-connector http://search.maven.org/#search|ga|1|spark-cassandra-connector

RE: Anybody using Spark SQL JDBC server with DSE Cassandra?

2015-06-01 Thread Mohammed Guller
Nobody using Spark SQL JDBC/Thrift server with DSE Cassandra? Mohammed From: Mohammed Guller [mailto:moham...@glassbeam.com] Sent: Friday, May 29, 2015 11:49 AM To: user@spark.apache.org Subject: Anybody using Spark SQL JDBC server with DSE Cassandra? Hi - We have successfully integrated Spark

Re: Execption writing on two cassandra tables NoHostAvailableException: All host(s) tried for query failed (no host was tried)

2015-06-01 Thread Antonio Giambanco
Hi Helena, thanks for answering me . . . I didn't realize it could be the connector version, unfortunately i didn't try yet. I know scala is better but i'm using drools and i'm forced to use java in my project i'm using spark-cassandra-connector-java_2.10 from cassandra I have only this log INFO

RE: Migrate Relational to Distributed

2015-06-01 Thread Mohammed Guller
Brant, You should be able to migrate most of your existing SQL code to Spark SQL, but remember that Spark SQL does not yet support the full ANSI standard. So you may need to rewrite some of your existing queries. Another thing to keep in mind is that Spark SQL is not real-time. The response

Re: union and reduceByKey wrong shuffle?

2015-06-01 Thread Igor Berman
switching to use simple pojos instead of using avro for spark serialization solved the problem(I mean reading avro from s3 and than mapping each avro object to it's pojo serializable counterpart with same fields, pojo is registered withing kryo) Any thought where to look for a

Dataframe random permutation?

2015-06-01 Thread Cesar Flores
I would like to know what will be the best approach to randomly permute a Data Frame. I have tried: df.sample(false,1.0,x).show(100) where x is the seed. However, it gives the same result no matter the value of x (it only gives different values when the fraction is smaller than 1.0) . I have

Re: Event Logging to HDFS on Standalone Cluster In Progress

2015-06-01 Thread Richard Marscher
Ah, apologies, I found an existing issue and fix has already gone out for this in 1.3.1 and up: https://issues.apache.org/jira/browse/SPARK-6036. On Mon, Jun 1, 2015 at 3:39 PM, Richard Marscher rmarsc...@localytics.com wrote: It looks like it is possibly a race condition between removing the

Re: Restricting the number of iterations in Mllib Kmeans

2015-06-01 Thread Joseph Bradley
Hi Suman Meethu, Apologies---I was wrong about KMeans supporting an initial set of centroids! JIRA created: https://issues.apache.org/jira/browse/SPARK-8018 If you're interested in submitting a PR, please do! Thanks, Joseph On Mon, Jun 1, 2015 at 2:25 AM, MEETHU MATHEW meethu2...@yahoo.co.in

SparkSQL's performance gets degraded depending on number of partitions of Hive tables..is it normal?

2015-06-01 Thread ogoh
Hello, I posted this question a while back but am posting it again to get your attention. I am using SparkSQL 1.3.1 and Hive 0.13.1 on AWS YARN (tested under both 1.3.0 1.3.1). My hive table is partitioned. I noticed that the query response time is bad depending on the number of partitions

Re: Event Logging to HDFS on Standalone Cluster In Progress

2015-06-01 Thread Richard Marscher
It looks like it is possibly a race condition between removing the IN_PROGRESS and building the history UI for the application. `AppClient` sends an `UnregisterApplication(appId)` message to the `Master` actor, which triggers the process to look for the app's eventLogs. If they are suffixed with

Spark 1.3.1 On Mesos Issues.

2015-06-01 Thread John Omernik
All - I am facing and odd issue and I am not really sure where to go for support at this point. I am running MapR which complicates things as it relates to Mesos, however this HAS worked in the past with no issues so I am stumped here. So for starters, here is what I am trying to run. This is a

Re: union and reduceByKey wrong shuffle?

2015-06-01 Thread Josh Rosen
How much work is to produce a small standalone reproduction? Can you create an Avro file with some mock data, maybe 10 or so records, then reproduce this locally? On Mon, Jun 1, 2015 at 12:31 PM, Igor Berman igor.ber...@gmail.com wrote: switching to use simple pojos instead of using avro for

map - reduce only with disk

2015-06-01 Thread octavian.ganea
Dear all, Does anyone know how can I force Spark to use only the disk when doing a simple flatMap(..).groupByKey.reduce(_ + _) ? Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/map-reduce-only-with-disk-tp23102.html Sent from the Apache Spark

Re: Dataframe random permutation?

2015-06-01 Thread Peter Rudenko
Hi Cesar, try to do: hc.createDataFrame(df.rdd.coalesce(NUM_PARTITIONS, shuffle =true),df.schema) It's a bit inefficient, but should shuffle the whole dataframe. Thanks, Peter Rudenko On 2015-06-01 22:49, Cesar Flores wrote: I would like to know what will be the best approach to randomly

Re: PySpark with OpenCV causes python worker to crash

2015-06-01 Thread Davies Liu
Could you run the single thread version in worker machine to make sure that OpenCV is installed and configured correctly? On Sat, May 30, 2015 at 6:29 AM, Sam Stoelinga sammiest...@gmail.com wrote: I've verified the issue lies within Spark running OpenCV code and not within the sequence file

Spark 1.3.1 bundle does not build - unresolved dependency

2015-06-01 Thread Stephen Boesch
I downloaded the 1.3.1 distro tarball $ll ../spark-1.3.1.tar.gz -rw-r-@ 1 steve staff 8500861 Apr 23 09:58 ../spark-1.3.1.tar.gz However the build on it is failing with an unresolved dependency: *configuration not public* $ build/sbt assembly -Dhadoop.version=2.5.2 -Pyarn -Phadoop-2.4

How to monitor Spark Streaming from Kafka?

2015-06-01 Thread dgoldenberg
Hi, What are some of the good/adopted approached to monitoring Spark Streaming from Kafka? I see that there are things like http://quantifind.github.io/KafkaOffsetMonitor, for example. Do they all assume that Receiver-based streaming is used? Then Note that one disadvantage of this approach

Re: Spark updateStateByKey fails with class leak when using case classes - resend

2015-06-01 Thread Tathagata Das
Interesting, only in local[*]! In the github you pointed to, what is the main that you were running. TD On Mon, May 25, 2015 at 9:23 AM, rsearle eggsea...@verizon.net wrote: Further experimentation indicates these problems only occur when master is local[*]. There are no issues if a

Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-01 Thread Andrew Or
Hi Deepak, This is a notorious bug that is being tracked at https://issues.apache.org/jira/browse/SPARK-4105. We have fixed one source of this bug (it turns out Snappy had a bug in buffer reuse that caused data corruption). There are other known sources that are being addressed in outstanding

Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-01 Thread Josh Rosen
If you can't run a patched Spark version, then you could also consider using LZF compression instead, since that codec isn't affected by this bug. On Mon, Jun 1, 2015 at 3:32 PM, Andrew Or and...@databricks.com wrote: Hi Deepak, This is a notorious bug that is being tracked at

Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Cody Koeninger
KafkaCluster.scala in the spark/extrernal/kafka project has a bunch of api code, including code for updating Kafka-managed ZK offsets. Look at setConsumerOffsets. Unfortunately all of that code is private, but you can either write your own, copy it, or do what I do (sed out private[spark] and

Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Tathagata Das
In the receiver-less direct approach, there is no concept of consumer group as we dont use the Kafka High Level consumer (that uses ZK). Instead Spark Streaming manages offsets on its own, giving tighter guarantees. If you want to monitor the progress of the processing of offsets, you will have to

Re: Spark 1.3.1 On Mesos Issues.

2015-06-01 Thread Dean Wampler
It would be nice to see the code for MapR FS Java API, but my google foo failed me (assuming it's open source)... So, shooting in the dark ;) there are a few things I would check, if you haven't already: 1. Could there be 1.2 versions of some Spark jars that get picked up at run time (but

Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Otis Gospodnetic
I think you can use SPM - http://sematext.com/spm - it will give you all Spark and all Kafka metrics, including offsets broken down by topic, etc. out of the box. I see more and more people using it to monitor various components in data processing pipelines, a la

Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Dmitry Goldenberg
Thank you, Tathagata, Cody, Otis. - Dmitry On Mon, Jun 1, 2015 at 6:57 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: I think you can use SPM - http://sematext.com/spm - it will give you all Spark and all Kafka metrics, including offsets broken down by topic, etc. out of the box.

HDFS Rest Service not available

2015-06-01 Thread Su She
Hello All, A bit scared I did something stupid...I killed a few PIDs that were listening to ports 2183 (kafka), 4042 (spark app), some of the PIDs didn't even seem to be stopped as they still are running when i do lsof -i:[port number] I'm not sure if the problem started after or before I did

Building Spark for Hadoop 2.6.0

2015-06-01 Thread Mulugeta Mammo
Does this build Spark for hadoop version 2.6.0? build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package Thanks!

Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-01 Thread ๏̯͡๏
Hello Josh, Are you suggesting to store the source data in LZF compression and use the same Spark code as is ? Currently its stored in sequence file format and compressed with GZIP. First line of the data: (SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'

Re: GroupBy on RDD returns empty collection

2015-06-01 Thread Malte
I just ran the same app with limited data on my personal machine - no error. Seems to be a mesos issue. Will investigate further. If anyone knows anything, let me know :) -- View this message in context:

Re: Best strategy for Pandas - Spark

2015-06-01 Thread Davies Liu
The second one sounds reasonable, I think. On Thu, Apr 30, 2015 at 1:42 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Let's assume I have a complex workflow of more than 10 datasources as input - 20 computations (some creating intermediary datasets and some merging

Re: deos randomSplit return a copy or a reference to the original rdd? [Python]

2015-06-01 Thread Davies Liu
No, all of the RDDs (including those returned from randomSplit()) are read-only. On Mon, Apr 27, 2015 at 11:28 AM, Pagliari, Roberto rpagli...@appcomsci.com wrote: Suppose I have something like the code below for idx in xrange(0, 10): train_test_split =

Spark 1.3.0: how to let Spark history load old records?

2015-06-01 Thread Haopu Wang
When I start the Spark master process, the old records are not shown in the monitoring UI. How to show the old records? Thank you very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands,

Re: Building Spark for Hadoop 2.6.0

2015-06-01 Thread Ted Yu
Looks good. -Dhadoop.version is not needed because the profile already defines it. profile idhadoop-2.6/id properties hadoop.version2.6.0/hadoop.version On Mon, Jun 1, 2015 at 5:51 PM, Mulugeta Mammo mulugeta.abe...@gmail.com wrote: Does this build Spark for hadoop