Re: Spark - Eclipse IDE - Maven

2015-07-24 Thread Anas Sherwani
Can you explain the issue? Further, in which language do you want to code? There are number of blogs to create a simple a maven project in Eclipse, and they are pretty simple and straightforward. -- View this message in context:

Re: How to restart Twitter spark stream

2015-07-24 Thread Akhil Das
Yes, that is correct, sorry for confusing you. But i guess this is what you are looking for, let me know if that doesn't help: val filtered_statuses = stream.transform(rdd ={ //Instead of hardcoding, you can fetch these from a MySQL or a file or whatever val sampleHashTags =

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-24 Thread Nicolas Phung
Hello, I manage to read all my data back with skipping offset that contains a corrupt message. I have one more question regarding messageHandler method vs dstream.foreachRDD.map vs dstream.map.foreachRDD best practices. I'm using a function to read the serialized message from kafka and convert it

Re: Partition parquet data by ENUM column

2015-07-24 Thread Cheng Lian
Could you please provide the full stack trace of the exception? And what's the Git commit hash of the version you were using? Cheng On 7/24/15 6:37 AM, Jerry Lam wrote: Hi Cheng, I ran into issues related to ENUM when I tried to use Filter push down. I'm using Spark 1.5.0 (which contains

RE: SparkR Supported Types - Please add bigint

2015-07-24 Thread Sun, Rui
printSchema calls StructField. buildFormattedString() to output schema information. buildFormattedString() use DataType.typeName as string representation of the data type. LongType. typeName = long LongType.simpleString = bigint I am not sure about the difference of these two type name

Re: Using Dataframe write with newHdoopApi

2015-07-24 Thread Akhil Das
1. Yes you can, have a look at the EsOutputFormat https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/mr/EsOutputFormat.java 2. I'm not quiet sure about that, you could ask the ES people about it. Thanks Best Regards On Thu, Jul 23, 2015 at 11:58

[POWERED BY] Please add our organization

2015-07-24 Thread Baxter, James
Hi there, Details below. Organisation: Woodside URL: http://www.woodside.com.au/ Components: Spark Core 1.31/1/41 and Spark SQL Use Case: Spark is being used for near real time predictive analysis over millions of equipment sensor readings and within our Data Integration processes for data

Re: What if request cores are not satisfied

2015-07-24 Thread Akhil Das
I guess it would wait for sometime and throw up something like this: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory Thanks Best Regards On Thu, Jul 23, 2015 at 7:53 AM, bit1...@163.com bit1...@163.com wrote:

Re: Spark - Eclipse IDE - Maven

2015-07-24 Thread Siva Reddy
I want to program in scala for spark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Eclipse-IDE-Maven-tp23977p23981.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Performance issue with Spak's foreachpartition method

2015-07-24 Thread Bagavath
Try using insert instead of merge. Typically we use insert append to do bulk inserts to oracle. On Thu, Jul 23, 2015 at 1:12 AM, diplomatic Guru diplomaticg...@gmail.com wrote: Thanks Robin for your reply. I'm pretty sure that writing to Oracle is taking longer as when writing to HDFS is

Encryption on RDDs or in-memory on Apache Spark

2015-07-24 Thread IASIB1
I am currently working on the latest version of Apache Spark (1.4.1), pre-built package for Hadoop 2.6+. Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory (similarly to Altibase's HDB: http://altibase.com/in-memory-database-computing-solutions/security/

Re: writing/reading multiple Parquet files: Failed to merge incompatible data types StringType and StructType

2015-07-24 Thread Cheng Lian
I don’t think this is a bug either. For an empty JSON array |[]|, there’s simply no way to infer its actual data type, and in this case Spark SQL just tries to fill in the “safest” type, which is |StringType|, because basically you can cast any data type to |StringType|. In general, schema

Re: problems running Spark on a firewalled remote YARN cluster via SOCKS proxy

2015-07-24 Thread Rok Roskar
Hi Akhil, the namenode is definitely configured correctly, otherwise the job would not start at all. It registers with YARN and starts up, but once the nodes try to communicate to each other it fails. Note that a hadoop MR job using the identical configuration executes without any problems. The

Re: Partition parquet data by ENUM column

2015-07-24 Thread Cheng Lian
Your guess is partly right. Firstly, Spark SQL doesn’t have an equivalent data type to Parquet BINARY (ENUM), and always falls back to normal StringType. So in your case, Spark SQL just see a StringType, which maps to Parquet BINARY (UTF8), but the underlying data type is BINARY (ENUM).

How do I query a DSE Cassandra table using Spark Job Server

2015-07-24 Thread rsaggere
I am trying to use the spark job server to run a query against a DSE Cassandra table. Can any one please help me with instructions? I am *NOT* a Java person. What I have done so far: 1. I have a single node DataStax Cassandra ver 4.6 running on a Centos 6.6 VM 2. Tested dse spark to query tables

Spark Accumulator Issue - java.io.IOException: java.lang.StackOverflowError

2015-07-24 Thread Jadhav Shweta
Hi, I am trying one transformation by calling scala method this scala method returns MutableList[AvroObject] def processRecords(id: String, list1: Iterable[(String, GenericRecord)]): scala.collection.mutable.MutableList[AvroObject]  Hence, the output of transaformation is

ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

2015-07-24 Thread Joji John
HI, I am getting this error for some of spark applications. I have multiple spark applications running in parallel. Is there a limit in the number of spark applications that I can run in parallel. ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service

Re: Facing problem in Oracle VM Virtual Box

2015-07-24 Thread Ajay Singal
Hi Chintan, This is more of Oracle VirtualBox virtualization issue than Spark issue. VT-x is hardware assisted virtualization and it is required by Oracle VirtualBox for all (64 bits) guests. The error message indicates that either your processor does not support VT-x (but your VM is

Spark: configuration file 'metrics.properties'

2015-07-24 Thread allonsy
Hi, Spark's configuration file (useful to retrieve metrics), namely //conf/metrics.properties/, states what follows: Within an instance, a source specifies a particular set of grouped metrics. there are two kinds of sources: 1. Spark internal sources, like /MasterSource/, /WorkerSource/, etc,

Re: ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

2015-07-24 Thread Ajay Singal
Hi Joji, To my knowledge, Spark does not offer any such function. I agree, defining a function to find an open (random) port would be a good option. However, in order to invoke the corresponding SparkUI one needs to know this port number. Thanks, Ajay On Fri, Jul 24, 2015 at 10:19 AM, Joji

New consumer - offset one gets in poll is not offset one is supposed to commit

2015-07-24 Thread Stevo Slavić
Hello Apache Kafka community, Say there is only one topic with single partition and a single message on it. Result of calling a poll with new consumer will return ConsumerRecord for that message and it will have offset of 0. After processing message, current KafkaConsumer implementation expects

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-24 Thread Arun Ahuja
Thanks for the additional info, I tried to follow that and went ahead and directly added netlib to my application POM/JAR - that should be sufficient to make it work? And that is at least definietely on the executor class path? Still got the same warning, so not sure where else to take it. Thanks

Re: ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

2015-07-24 Thread Joji John
Thanks Ajay. The way we wrote our spark application is that we have a generic python code, multiple instances of which can be called using different parameters. Does spark offer any function to bind it to a available port? I guess the other option is to define a function to find open port

Re: New consumer - offset one gets in poll is not offset one is supposed to commit

2015-07-24 Thread Cody Koeninger
Well... there are only 2 hard problems in computer science: naming things, cache invalidation, and off-by-one errors. The direct stream implementation isn't asking you to commit anything. It's asking you to provide a starting point for the stream on startup. Because offset ranges are inclusive

Parquet writing gets progressively slower

2015-07-24 Thread Michael Kelly
Hi, We are converting some csv log files to parquet but the job is getting progressively slower the more files we add to the parquet folder. The parquet files are being written to s3, we are using a spark standalone cluster running on ec2 and the spark version is 1.4.1. The parquet files are

Re: spark-submit and spark-shell behaviors mismatch.

2015-07-24 Thread Yana Kadiyska
that is pretty odd -- toMap not being there would be from scala...but what is even weirder is that toMap is positively executed on the driver machine, which is the same when you do spark-shell and spark-submit...does it work if you run with --master local[*]? Also, you can try to put a set -x in

Re: How to restart Twitter spark stream

2015-07-24 Thread Zoran Jeremic
Hi Akhil, That's exactly what I needed. You saved my day :) Thanks a lot, Best, Zoran On Fri, Jul 24, 2015 at 12:28 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Yes, that is correct, sorry for confusing you. But i guess this is what you are looking for, let me know if that doesn't help:

Re: Programmatically launch several hundred Spark Streams in parallel

2015-07-24 Thread Brandon White
THanks. Sorry the last section was supposed be streams.par.foreach { nameAndStream = nameAndStream._2.foreachRDD { rdd = df = sqlContext.jsonRDD(rdd) df.insertInto(stream._1) } } ssc.start() On Fri, Jul 24, 2015 at 10:39 AM, Dean Wampler deanwamp...@gmail.com wrote: You don't

How to maintain Windows of data along with maintaining session state using UpdateStateByKey

2015-07-24 Thread swetha
Hi, We have a requirement to maintain the user session state and to maintain/update the metrics for minute, day and hour granularities for a user session in our Streaming job. How to maintain the metrics over a window along with maintaining the session state using updateStateByKey in Spark

Re: New consumer - offset one gets in poll is not offset one is supposed to commit

2015-07-24 Thread Stevo Slavić
Sorry, wrong ML. On Fri, Jul 24, 2015 at 7:07 PM, Cody Koeninger c...@koeninger.org wrote: Are you intending to be mailing the spark list or the kafka list? On Fri, Jul 24, 2015 at 11:56 AM, Stevo Slavić ssla...@gmail.com wrote: Hello Cody, I'm not sure we're talking about same thing.

Re: Programmatically launch several hundred Spark Streams in parallel

2015-07-24 Thread Dean Wampler
You don't need the par (parallel) versions of the Scala collections, actually, Recall that you are building a pipeline in the driver, but it doesn't start running cluster tasks until ssc.start() is called, at which point Spark will figure out the task parallelism. In fact, you might as well do the

Re: New consumer - offset one gets in poll is not offset one is supposed to commit

2015-07-24 Thread Stevo Slavić
Hello Cody, I'm not sure we're talking about same thing. Since you're mentioning streams I guess you were referring to current high level consumer, while I'm talking about new yet unreleased high level consumer. Poll I was referring to is

Re: New consumer - offset one gets in poll is not offset one is supposed to commit

2015-07-24 Thread Cody Koeninger
Are you intending to be mailing the spark list or the kafka list? On Fri, Jul 24, 2015 at 11:56 AM, Stevo Slavić ssla...@gmail.com wrote: Hello Cody, I'm not sure we're talking about same thing. Since you're mentioning streams I guess you were referring to current high level consumer,

Re: Writing binary files in Spark

2015-07-24 Thread Oren Shpigel
Sorry, I didn't mention I'm using the Python API, which doesn't have the saveAsObjectFiles method. Is there any alternative from Python? And also, I want to write the raw bytes of my object into files on disk, and not using some Serialization format to be read back into Spark. Is it possible? Any

Re: Mesos + Spark

2015-07-24 Thread boci
Thanks, but something is not clear... I have the mesos cluster. - I want to submit my application and scheduled with chronos. - For cluster mode I need a dispatcher, this is another container (machine in the real world)? What will this do? It's needed when I using chronos? - How can I access to my

Re: Mesos + Spark

2015-07-24 Thread Dean Wampler
When running Spark in Mesos cluster mode, the driver program runs in one of the cluster nodes, like the other Spark processes that are spawned. You won't need a special node for this purpose. I'm not very familiar with Chronos, but its UI or the regular Mesos UI should show you where the driver is

Re: ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

2015-07-24 Thread Ajay Singal
Hi Jodi, I guess, there is no hard limit on number of Spark applications running in parallel. However, you need to ensure that you do not use the same (e.g., default) port numbers for each application. In your specific case, for example, if you try using default SparkUI port 4040 for more than

Broadcast HashMap much slower than Array

2015-07-24 Thread huanglr
Hi, When I try to broadcast a hashmap, it runs much slower than the same data broadcast in array. It hangs in SparkContext: Created broadcast 0 for few secondes (30s), while an array does not. The broadcast dataset is about 1G. best! huanglr

50% performance decrease when using local file vs hdfs

2015-07-24 Thread Tom Hubregtsen
Hi, When running two experiments with the same application, we see a 50% performance difference between using HDFS and files on disk, both using the textFile/saveAsTextFile call. Almost all performance loss is in Stage 1. Input (in Stage 0): The file is read in using val input =

spark classpath issue duplicate jar with diff versions

2015-07-24 Thread Shushant Arora
Hi I am running a spark stream app on yarn and using apache httpasyncclient 4.1 This client Jar internally has a dependency on jar http-core4.4.1.jar. This jar's( http-core .jar) old version i.e. httpcore-4.2.5.jar is also present in class path and has higher priority in classpath(coming earlier

Fwd: want to contribute to apache spark

2015-07-24 Thread shashank kapoor
Hi guys, I am new to apache spark, I wanted to start contributing to this project. But before that I need to understand the basic coding flow here. I read How to contribute to apache spark but I couldn't find any way to start reading the code and start understanding Code Flow. Can anyone tell me

Small-cluster deployment modes

2015-07-24 Thread Edmon Begoli
Hey folks, I am wanting to setup a single machine or a small cluster machine to run our Spark based exploration lab. Does anyone have suggestions or metrics on feasibility of running Spark standalone on a good size RAM machine (64GB) with SSDs without resource manager. I expect on or two users

Re: want to contribute to apache spark

2015-07-24 Thread Joseph Bradley
Please checkout the Spark source from Github, and look here: https://github.com/apache/spark/tree/master/examples/src/main On Fri, Jul 24, 2015 at 8:43 PM, Chintan Bhatt chintanbhatt...@charusat.ac.in wrote: Hi. Can I know how to get such folder/code for spark implementation? On Sat,

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-24 Thread Cody Koeninger
It's really a question of whether you need access to the MessageAndMetadata, or just the key / value from the message. If you just need the key/value, dstream map is fine. In your case, since you need to be able to control a possible failure when deserializing the message from the

Performance questions regarding Spark 1.3 standalone mode

2015-07-24 Thread Khaled Ammar
Hi all, I have a standalone spark cluster setup on EC2 machines. I did the setup manually without the ec2 scripts. I have two questions about Spark/GraphX performance: 1) When I run the PageRank example, the storage tab does not show that all RDDs are cached. Only one RDD is 100% cached, but the

getting Error while Running SparkPi program

2015-07-24 Thread Jeetendra Gangele
while running below getting the error un yarn log can anybody hit this issue ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster lib/spark-examples-1.4.1-hadoop2.6.0.jar 10 2015-07-24 12:06:10,846 ERROR [RMCommunicator Allocator]

Re: Mesos + Spark

2015-07-24 Thread Dean Wampler
You can certainly start jobs without Chronos, but to automatically restart finished jobs or to run jobs at specific times or periods, you'll want something like Chronos. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly)

Re: want to contribute to apache spark

2015-07-24 Thread Joseph Bradley
I'd recommend starting with a few of the code examples to get a sense of Spark usage (in the examples/ folder when you check out the code). Then, you can work through the Spark methods they call, tracing as deep as needed to understand the component you are interested in. You can also find an