Spark 1.5.2 + Hive 1.0.0 in Amazon EMR 4.2.0

2015-11-30 Thread Daniel Lopes
Hi, I get this error when trying to write Spark DataFrame to Hive Table Stored as TextFile sqlContext.sql('INSERT OVERWRITE TABLE analytics.client_view_stock *(hive table)* SELECT * FROM client_view_stock'*(spark temp table)*') Erro: 15/11/30 21:40:14 INFO latency: StatusCode=[404],

Re: Help with type check

2015-11-30 Thread Jakob Odersky
Hi Eyal, what you're seeing is not a Spark issue, it is related to boxed types. I assume 'b' in your code is some kind of java buffer, where b.getDouble() returns an instance of java.lang.Double and not a scala.Double. Hence muCouch is an Array[java.lang.Double], an array containing boxed

RE: Cant start master on windows 7

2015-11-30 Thread Tim Barthram
Hi Jacek, To run a spark master on my windows box, I've created a .bat file with contents something like: .\bin\spark-class.cmd org.apache.spark.deploy.master.Master --host For the worker: .\bin\spark-class.cmd org.apache.spark.deploy.worker.Worker spark://:7077 To wrap these in services,

PySpark failing on a mid-sized broadcast

2015-11-30 Thread ameyc
So I'm running PySpark 1.3.1 on Amazon EMR on a fairly beefy cluster (20 node cluster with 32 cores each node and 64 gig memory) and my parallelism, executor.instances, executor.cores and executor memory settings are also fairly reasonable (600, 20, 30, 48gigs). However my job invariably fails

dfs.blocksize is not applicable to some cases

2015-11-30 Thread Jung
Hello, I use Spark 1.4.1 and Hadoop 2.2.0. It may be a stupid question but I cannot understand why "dfs.blocksize" in hadoop option doesn't affect the number of blocks sometimes. When I run the script below, val BLOCK_SIZE = 1024 * 1024 * 512 // set to 512MB, hadoop default is 128MB

Re: Spark Streaming on mesos

2015-11-30 Thread Iulian Dragoș
Hi, Latency isn't such a big issue as it sounds. Did you try it out and failed some performance metrics? In short, the *Mesos* executor on a given slave is going to be long-running (consuming memory, but no CPUs). Each Spark task will be scheduled using Mesos CPU resources, but they don't suffer

Re: Cant start master on windows 7

2015-11-30 Thread Jacek Laskowski
On Fri, Nov 27, 2015 at 4:27 PM, Shuo Wang wrote: > I am trying to use the start-master.sh script on windows 7. >From http://spark.apache.org/docs/latest/spark-standalone.html: "Note: The launch scripts do not currently support Windows. To run a Spark cluster on Windows,

Logs of Custom Receiver

2015-11-30 Thread Matthias Niehoff
Hi, I've built a customer receiver and deployed an application using this receiver to a cluster. When I run the application locally I see the log output my logger in Stdout/Stderr but when I run it on the cluster I don't see the log output in Stdout/Stderr. I just see the logs in the constructor

Re: Debug Spark

2015-11-30 Thread Jacek Laskowski
Hi, Yes, that's possible -- I'm doing it every day in local and standalone modes. Just use SPARK_PRINT_LAUNCH_COMMAND=1 before any Spark command, i.e. spark-submit, spark-shell, to know the command to start it: $ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell SPARK_PRINT_LAUNCH_COMMAND

Re: How to get a single available message from kafka (case where OffsetRange.fromOffset == OffsetRange.untilOffset)

2015-11-30 Thread Cody Koeninger
If you had exactly 1 message in the 0th topicpartition, to read it you would use OffsetRange("topicname", 0, 0, 1) Kafka's simple shell consumer in that case would print next offset = 1 So instead trying to consume OffsetRange("topicname", 0, 1, 2) shouldn't be expected to work On Sat,

Re: Spark directStream with Kafka and process the lost messages.

2015-11-30 Thread Guillermo Ortiz
Then,,, something is wrong in my code ;), thanks. 2015-11-30 16:46 GMT+01:00 Cody Koeninger : > Starting from the checkpoint using getOrCreate should be sufficient if all > you need is at-least-once semantics > > >

Spark directStream with Kafka and process the lost messages.

2015-11-30 Thread Guillermo Ortiz
Hello, I have Spark and Kafka with directStream. I'm trying that if Spark dies it could process all those messages when it starts. The offsets are stored in chekpoints but I don't know how I could say to Spark to start in that point. I saw that there's another createDirectStream method with a

Re: Spark Streaming on mesos

2015-11-30 Thread Renjie Liu
Hi, Lulian: Are you sure that it'll be a long running process in fine-grained mode? I think you have a misunderstanding about it. An executor will be launched for some tasks, but not a long running process. When a group of tasks finished, it will get shutdown. On Mon, Nov 30, 2015 at 6:25 PM

Re: Permanent RDD growing with Kafka DirectStream

2015-11-30 Thread Cody Koeninger
Can you post the relevant code? On Fri, Nov 27, 2015 at 4:25 AM, u...@moosheimer.com wrote: > Hi, > > we have some strange behavior with KafkaUtils DirectStream and the size of > the MapPartitionsRDDs. > > We use a permanent direct steam where we consume about 8.500 json >

Re: Spark directStream with Kafka and process the lost messages.

2015-11-30 Thread Cody Koeninger
Starting from the checkpoint using getOrCreate should be sufficient if all you need is at-least-once semantics http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing On Mon, Nov 30, 2015 at 9:38 AM, Guillermo Ortiz wrote: > Hello, > > I have

Re: Logs of Custom Receiver

2015-11-30 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtEor1vYWbsW=RE+Configuring+Log4J+Spark+1+5+on+EMR+4+1+ which mentioned SPARK-11105 FYI 2015-11-30 9:00 GMT-08:00 Matthias Niehoff : > Hi, > > I've built a customer receiver and deployed an application

Re: SparkException: Failed to get broadcast_10_piece0

2015-11-30 Thread Spark Newbie
Pinging again ... On Wed, Nov 25, 2015 at 4:19 PM, Ted Yu wrote: > Which Spark release are you using ? > > Please take a look at: > https://issues.apache.org/jira/browse/SPARK-5594 > > Cheers > > On Wed, Nov 25, 2015 at 3:59 PM, Spark Newbie >

Obtaining Job Id for query submitted via Spark Thrift Server

2015-11-30 Thread Jagrut Sharma
Is there a way to get the Job Id for a query submitted via the Spark Thrift Server? This would allow checking the status of that specific job via the History Server. Currently, I'm getting status of all jobs, and then filtering the results. Looking for a more efficient approach. Test environment

Re: how to using local repository in spark[dev]

2015-11-30 Thread Jacek Laskowski
Hi, Maven uses its own repo as does sbt. To cross the repo boundaries use the following: resolvers += Resolver.mavenLocal in your build.sbt or any other build definition as described in http://www.scala-sbt.org/0.13/tutorial/Library-Dependencies.html#Resolvers. You did it so let's give the

Re: Spark on yarn vs spark standalone

2015-11-30 Thread Jacek Laskowski
Hi, My understanding of Spark on YARN and even Spark in general is very limited so keep that in mind. I'm not sure why you compare yarn-cluster and spark standalone? In yarn-cluster a driver runs on a node in the YARN cluster while spark standalone keeps the driver on the machine you launched a

Re: Spark, Windows 7 python shell non-reachable ip address

2015-11-30 Thread Jacek Laskowski
Hi, I'd call it a known issue on Windows, and have no solution, but using SPARK_LOCAL_HOSTNAME or SPARK_LOCAL_IP before starting pyshell to *work it around*. I wished I had access to Win7 to work on it longer and find a decent solution (not a workaround). If you have Scala REPL, execute

Re: dfs.blocksize is not applicable to some cases

2015-11-30 Thread Ted Yu
I am not expert in Parquet. Looking at PARQUET-166, it seems that parquet.block.size should be lower than dfs.blocksize Have you tried Spark 1.5.2 to see if the problem persists ? Cheers On Mon, Nov 30, 2015 at 1:55 AM, Jung wrote: > Hello, > I use Spark 1.4.1 and Hadoop

Help with type check

2015-11-30 Thread Eyal Sharon
Hi , I have problem with inferring what are the types bug here I have this code fragment . it parse Json to Array[Double] *val muCouch = { val e = input.filter( _.id=="mu")(0).content() val b = e.getArray("feature_mean") for (i <- 0 to e.getInt("features") ) yield

Re: Spark on yarn vs spark standalone

2015-11-30 Thread Jacek Laskowski
Hi Mark, I said I've only managed to develop a limited understanding of how Spark works in the different deploy modes ;-) But somehow I thought that cluster in spark standalone is not supported. I think I've seen a JIRA with a change quite recently where it was said or something similar. Can't

Re: Spark on yarn vs spark standalone

2015-11-30 Thread Mark Hamstra
Standalone mode also supports running the driver on a cluster node. See "cluster" mode in http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications . Also, http://spark.apache.org/docs/latest/spark-standalone.html#high-availability On Mon, Nov 30, 2015 at 9:47 AM,

Re: In yarn-client mode, is it the driver or application master that issue commands to executors?

2015-11-30 Thread Jacek Laskowski
On Fri, Nov 27, 2015 at 12:12 PM, Nisrina Luthfiyati < nisrina.luthfiy...@gmail.com> wrote: > Hi all, > I'm trying to understand how yarn-client mode works and found these two > diagrams: > > > > > In the first diagram, it looks like the driver running in client directly > communicates with

Re: spark.cleaner.ttl for 1.4.1

2015-11-30 Thread Josh Rosen
AFAIK the ContextCleaner should perform all of the cleaning *as long as garbage collection is performed frequently enough on the driver*. See https://issues.apache.org/jira/browse/SPARK-7689 and https://github.com/apache/spark/pull/6220#issuecomment-102950055 for discussion of this technicality.

Re: question about combining small parquet files

2015-11-30 Thread Nezih Yigitbasi
This looks interesting, thanks Ruslan. But, compaction with Hive is as simple as an insert overwrite statement as Hive supports CombineFileInputFormat, is it possible to do the same with Spark? On Thu, Nov 26, 2015 at 9:47 AM, Ruslan Dautkhanov wrote: > An interesting

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-30 Thread swetha kasireddy
Hi Cody, What if the Offsets that are tracked are not present in Kafka. How do I skip those offsets and go to the next Offset? Also would specifying rebalance.backoff.ms be of any help? Thanks, Swteha On Thu, Nov 12, 2015 at 9:07 AM, Cody Koeninger wrote: > To be blunt, if

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-30 Thread Cody Koeninger
You'd need to get the earliest or latest available offsets from kafka, whichever is most appropriate for your situation. The KafkaRDD will use the value of refresh.leader.backoff.ms, so you can try adjusting that to get a longer sleep before retrying the task. On Mon, Nov 30, 2015 at 1:50 PM,

spark rdd grouping

2015-11-30 Thread Rajat Kumar
Hi i have a javaPairRdd rdd1. i want to group by rdd1 by keys but preserve the partitions of original rdd only to avoid shuffle since I know all same keys are already in same partition. PairRdd is basically constrcuted using kafka streaming low level consumer which have all records with

Obtaining Job Id for query submitted via Spark Thrift Server

2015-11-30 Thread Jagrut Sharma
Is there a way to get the Job Id for a query submitted via the Spark Thrift Server? This would allow checking the status of that specific job via the History Server. Currently, I'm getting status of all jobs, and then filtering the results. Looking for a more efficient approach. Test environment

Re: Spark 1.5.2 + Hive 1.0.0 in Amazon EMR 4.2.0

2015-11-30 Thread Ted Yu
Please see the following which went to Hive 1.1 : HIVE-8839 Support "alter table .. add/replace columns cascade" FYI On Mon, Nov 30, 2015 at 2:14 PM, Daniel Lopes wrote: > Hi, > > I get this error when trying to write Spark DataFrame to Hive Table Stored > as TextFile

Re: Grid search with Random Forest

2015-11-30 Thread Joseph Bradley
It should work with 1.5+. On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar wrote: > > Hi folks, > > Does anyone know whether the Grid Search capability is enabled since the > issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol > column doesn't exist" when

RE: capture video with spark streaming

2015-11-30 Thread Young, Matthew T
Unless it’s a network camera with the ability to request specific frame numbers for read, the answer is that you will just read from the camera like you normally would without Spark inside of a foreachrdd() and parallelize the result out for processing once you have it in a collection in the

Spark Streaming Specify Kafka Partition

2015-11-30 Thread Alan Braithwaite
Is there any mechanism in the kafka streaming source to specify the exact partition id that we want a streaming job to consume from? If not, is there a workaround besides writing our a custom receiver? Thanks, - Alan

ORC predicate pushdown in HQL

2015-11-30 Thread Tejas Patil
Hi, I am trying to use predicate pushdown in ORC and I was expecting that it would be used when one tries to query an ORC table in Hive. But based on the query plan generated, I don't see that happening. I am missing some configurations or this is not supported at all. PS: I have tried the

Re: ORC predicate pushdown in HQL

2015-11-30 Thread Ted Yu
This is related: SPARK-11087 FYI On Mon, Nov 30, 2015 at 5:56 PM, Tejas Patil wrote: > Hi, > I am trying to use predicate pushdown in ORC and I was expecting that it > would be used when one tries to query an ORC table in Hive. But based on > the query plan generated,

Re: Grid search with Random Forest

2015-11-30 Thread Ndjido Ardo BAR
Hi Joseph, Yes Random Forest support Grid Search on Spark 1.5.+ . But I'm getting a "rawPredictionCol field does not exist exception" on Spark 1.5.2 for Gradient Boosting Trees classifier. Ardo On Tue, 1 Dec 2015 at 01:34, Joseph Bradley wrote: > It should work with

Re: dfs.blocksize is not applicable to some cases

2015-11-30 Thread Jung
Yes, I can reproduce it in Spark 1.5.2. This is the results. 1. first case(1block) 221.1 M /user/hive/warehouse/partition_test/part-r-0-b0e5ecd3-75a3-4c92-94ec-59353d08067a.gz.parquet 221.1 M

Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-11-30 Thread SRK
Hi, So, our Streaming Job fails with the following errors. If you see the errors below, they are all related to Kafka losing offsets and OffsetOutOfRangeException. What are the options we have other than fixing Kafka? We would like to do something like the following. How can we achieve 1 and 2

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-30 Thread swetha kasireddy
So, our Streaming Job fails with the following errors. If you see the errors(highlighted in blue below), they are all related to Kafka losing offsets and OffsetOutOfRangeException. What are the options we have other than fixing Kafka? We would like to do something like the following. How can we

capture video with spark streaming

2015-11-30 Thread Lavallen Pablo
Hello! Can anyone guide me please, on how to capture video from a camera with spark streaming ? any article or book to read to recommend me ? thank you

Spark streaming job hangs

2015-11-30 Thread Cassa L
Hi, I am reading data from Kafka into spark. It runs fine for sometime but then hangs forever with following output. I don't see and errors in logs. How do I debug this? 2015-12-01 06:04:30,697 [dag-scheduler-event-loop] INFO (Logging.scala:59) - Adding task set 19.0 with 4 tasks 2015-12-01

Re: question about combining small parquet files

2015-11-30 Thread Sabarish Sasidharan
You could use the number of input files to determine the number of output partitions. This assumes your input file sizes are deterministic. Else, you could also persist the RDD and then determine it's size using the apis. Regards Sab On 26-Nov-2015 11:13 pm, "Nezih Yigitbasi"

SparkPi running slower with more cores on each worker

2015-11-30 Thread yiskylee
Hi, I was running the SparkPi example code and studying their performance difference using different number of cores per worker. I change the number of cores by using start-slave.sh -c CORES on the worker machine for distributed computation. I also use spark-submit --master local[CORES] for the

Re: Grid search with Random Forest

2015-11-30 Thread Benjamin Fradet
Hi Ndjido, This is because GBTClassifier doesn't yet have a rawPredictionCol like the. RandomForestClassifier has. Cf: http://spark.apache.org/docs/latest/ml-ensembles.html#output-columns-predictions-1 On 1 Dec 2015 3:57 a.m., "Ndjido Ardo BAR" wrote: > Hi Joseph, > > Yes

Re: Grid search with Random Forest

2015-11-30 Thread Ndjido Ardo BAR
Hi Benjamin, Thanks, the documentation you sent is clear. Is there any other way to perform a Grid Search with GBT? Ndjido On Tue, 1 Dec 2015 at 08:32, Benjamin Fradet wrote: > Hi Ndjido, > > This is because GBTClassifier doesn't yet have a rawPredictionCol like >

Re: Unable to use "Batch Start Time" on worker nodes.

2015-11-30 Thread Abhishek Anand
Thanks TD !! I think this should solve my purpose. On Sun, Nov 29, 2015 at 6:17 PM, Tathagata Das wrote: > You can get the batch start (the expected, not the exact time when the > jobs are submitted) from DStream operation "transform". There is a version > of transform

Re: How to work with a joined rdd in pyspark?

2015-11-30 Thread arnalone
ahhh I get it thx!! I did not know that we can use "double index" I used x[0] to point on shows, x[1][0] to point on channels x[1][1] to point on views. I feel terribly noob. Thank you all :) -- View this message in context:

Persisting closed sessions to external store inside updateStateByKey

2015-11-30 Thread Anthony Brew
Hi, I'm working on storing effectively what is a session that receives its close event using spark streaming, by using updateStateByKey. private static Function2 COLLECTED_SESSION = (newItems, current) -> { SessionUpdate returnValue = current.orNull();

Spark DStream Data stored out of order in Cassandra

2015-11-30 Thread Prateek .
Hi, I have an time critical spark application, which is taking sensor data from kafka stream, storing in case class, applying transformations and then storing in cassandra schema. The data needs to be stored in schema, in FIFO order. The order is maintained at kafka queue but I am observing,

No documentation for how to write custom Transformer in ml pipeline ?

2015-11-30 Thread Jeff Zhang
Although writing a custom UnaryTransformer is not difficult, but writing a non-UnaryTransformer is a little tricky (have to check the source code). And I don't find any document about how to write custom Transformer in ml pipeline, but writing custom Transformer is a very basic requirement. Is

Re: PySpark failing on a mid-sized broadcast

2015-11-30 Thread ameyc
BTW, my spark.python.worker.reuse setting is set to "true". -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-failing-on-a-mid-sized-broadcast-tp25520p25521.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark DStream Data stored out of order in Cassandra

2015-11-30 Thread Gerard Maas
Spark Streaming will consumer and process data in parallel. So the order of the output will depend not only on the order of the input but also in the time it takes for each task to process. Different options, like repartitions, sorts and shuffles at Spark level will also affect ordering, so the