Re: Kafka Version Update 0.8.2 status?

2015-02-10 Thread Sean Owen
Yes, did you see the PR for SPARK-2808? https://github.com/apache/spark/pull/3631/files It requires more than just changing the version. On Tue, Feb 10, 2015 at 3:11 PM, Ted Yu yuzhih...@gmail.com wrote: Compiling Spark master branch against Kafka 0.8.2, I got: [WARNING]

Re: hadoopConfiguration for StreamingContext

2015-02-10 Thread Marc Limotte
Thanks, Akhil. I had high hopes for #2, but tried all and no luck. I was looking at the source and found something interesting. The Stack Trace (below) directs me to FileInputDStream.scala (line 141). This is version 1.1.1, btw. Line 141 has: private def fs: FileSystem = { if (fs_ ==

Re: Kafka Version Update 0.8.2 status?

2015-02-10 Thread Ted Yu
Compiling Spark master branch against Kafka 0.8.2, I got: [WARNING] /home/hbase/spark/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala:62: no valid targets for annotation on value ssc_ - it is discarded unused. You may specify targets with

Re: Kafka Version Update 0.8.2 status?

2015-02-10 Thread Cody Koeninger
That PR hasn't been updated since the new kafka streaming stuff (including KafkaCluster) got merged to master, it will require more changes than what's in there currently. On Tue, Feb 10, 2015 at 9:25 AM, Sean Owen so...@cloudera.com wrote: Yes, did you see the PR for SPARK-2808?

Re: hadoopConfiguration for StreamingContext

2015-02-10 Thread Akhil Das
Try the following: 1. Set the access key and secret key in the sparkContext: ssc.sparkContext.hadoopConfiguration.set(AWS_ACCESS_KEY_ID,yourAccessKey) ssc.sparkContext.hadoopConfiguration.set(AWS_SECRET_ACCESS_KEY,yourSecretKey) 2. Set the access key and secret key in the environment before

Bug in ElasticSearch and Spark SQL: Using SQL to query out data from JSON documents is totally wrong!

2015-02-10 Thread Aris
I'm using ElasticSearch with elasticsearch-spark-BUILD-SNAPSHOT and Spark/SparkSQL 1.2.0, from Costin Leau's advice. I want to query ElasticSearch for a bunch of JSON documents from within SparkSQL, and then use a SQL query to simply query for a column, which is actually a JSON key -- normal

org.apache.spark.SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up.

2015-02-10 Thread lakewood
Hi, I'm new to Spark. I have built small spark on yarn cluster, which contains 1 master(20GB RAM, 8 core), 3 worker(4GB RAM, 4 core). When trying to run a command sc.parallelize(1 to 1000).count() through $SPARK_HOME/bin/spark-shell, sometimes the command can submit a job successfully, sometimes

Re: [spark sql] add file doesn't work

2015-02-10 Thread wangzhenhua (G)
[Additional info] I was using the master branch of 9 Feb 2015, the latest commit in git info is: commit 0793ee1b4dea1f4b0df749e8ad7c1ab70b512faf Author: Sandy Ryza sa...@cloudera.com Date: Mon Feb 9 10:12:12 2015 + SPARK-2149. [MLLIB] Univariate kernel density estimation Author:

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Todd Nist
Hi Silvio, So the Initial SQL is executing now, I did not have the * added that and it worked fine. FWIW the * is not needed for the parquet files: create temporary table test using org.apache.spark.sql.json options (path '/data/out/*') ; cache table test; select count(1) from test;

Why does spark write huge file into temporary local disk even without on-disk persist or checkpoint?

2015-02-10 Thread Peng Cheng
I'm running a small job on a cluster with 15G of mem and 8G of disk per machine. The job always get into a deadlock where the last error message is: java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at

Something about the cluster upgrade

2015-02-10 Thread qiaou
Hi, Now I need to upgrade my spark cluster from version 1.1.0 to 1.2.1 , if there is convenient way to do this. something like ./start-dfs.sh (http://start-dfs.sh) -upgrade in hadoop Best Wishs THX -- qiaou 已使用 Sparrow (http://www.sparrowmailapp.com/?sig)

RE: hadoopConfiguration for StreamingContext

2015-02-10 Thread Andrew Lee
It looks like this is related to the underlying Hadoop configuration. Try to deploy the Hadoop configuration with your job with --files and --driver-class-path, or to the default /etc/hadoop/conf core-site.xml. If that is not an option (depending on how your Hadoop cluster is setup), then hard

Re: Can we execute create table and load data commands against Hive inside HiveContext?

2015-02-10 Thread Yin Huai
org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory was introduced in Hive 0.14 and Spark SQL only supports Hive 0.12 and 0.13.1. Can you change the setting of hive.security.authorization.manager to someone accepted by 0.12 or 0.13.1? On Thu, Feb 5, 2015

Re: Resource allocation in yarn-cluster mode

2015-02-10 Thread Sandy Ryza
Hi Zsolt, spark.executor.memory, spark.executor.cores, and spark.executor.instances are only honored when launching through spark-submit. Marcelo is working on a Spark launcher (SPARK-4924) that will enable using these programmatically. That's correct that the error comes up when

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Sandy Ryza
Is the SparkContext you're using the same one that the StreamingContext wraps? If not, I don't think using two is supported. -Sandy On Tue, Feb 10, 2015 at 9:58 AM, Jon Gregg jonrgr...@gmail.com wrote: I'm still getting an error. Here's my code, which works successfully when tested using

Re: Exception when trying to use EShadoop connector and writing rdd to ES

2015-02-10 Thread Costin Leau
First off, I'd recommend using the latest es-hadoop beta (2.1.0.Beta3) or even better, the dev build [1]. Second, using the native Java/Scala API [2] since the configuration and performance are both easier. Third, when you are using JSON input, tell es-hadoop/spark that. the connector can work

Re: Exception when trying to use EShadoop connector and writing rdd to ES

2015-02-10 Thread shahid ashraf
hi costin i upgraded the es hadoop connector , and at this point i can't use scala, but still getting same error On Tue, Feb 10, 2015 at 10:34 PM, Costin Leau costin.l...@gmail.com wrote: Hi shahid, I've sent the reply to the group - for some reason I replied to your address instead of the

Re: Resource allocation in yarn-cluster mode

2015-02-10 Thread Zsolt Tóth
One more question: Is there reason why Spark throws an error when requesting too much memory instead of capping it to the maximum value (as YARN would do by default)? Thanks! 2015-02-10 17:32 GMT+01:00 Zsolt Tóth toth.zsolt@gmail.com: Hi, I'm using Spark in yarn-cluster mode and submit

does updateStateByKey return Seq() ordered?

2015-02-10 Thread Adrian Mocanu
I was looking at updateStateByKey documentation, It passes in a values Seq which contains values that have the same key. I would like to know if there is any ordering to these values. My feeling is that there is no ordering, but maybe it does preserve RDD ordering. Example: RDD[ (a,2), (a,3),

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Jon Gregg
I'm still getting an error. Here's my code, which works successfully when tested using spark-shell: val badIPs = sc.textFile(/user/sb/badfullIPs.csv).collect val badIpSet = badIPs.toSet val badIPsBC = sc.broadcast(badIpSet) The job looks OK from my end: 15/02/07 18:59:58

Re: OutofMemoryError: Java heap space

2015-02-10 Thread Kelvin Chu
Since the stacktrace shows kryo is being used, maybe, you could also try increasing spark.kryoserializer.buffer.max.mb. Hope this help. Kelvin On Tue, Feb 10, 2015 at 1:26 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You could try increasing the driver memory. Also, can you be more specific

Resource allocation in yarn-cluster mode

2015-02-10 Thread Zsolt Tóth
Hi, I'm using Spark in yarn-cluster mode and submit the jobs programmatically from the client in Java. I ran into a few issues when tried to set the resource allocation properties. 1. It looks like setting spark.executor.memory, spark.executor.cores and spark.executor.instances have no effect

Re: Spark SQL - Column name including a colon in a SELECT clause

2015-02-10 Thread Yin Huai
Can you try using backticks to quote the field name? Like `f:price`. On Tue, Feb 10, 2015 at 5:47 AM, presence2001 neil.andra...@thefilter.com wrote: Hi list, I have some data with a field name of f:price (it's actually part of a JSON structure loaded from ElasticSearch via

[POWERED BY] Radius Intelligence

2015-02-10 Thread Alexis Roos
Also long due given our usage of Spark .. Radius Intelligence: URL: radius.com Description: Spark, MLLib Using Scala, Spark and MLLib for Radius Marketing and Sales intelligence platform including data aggregation, data processing, data clustering, data analysis and predictive modeling of

Re: Executor Lost with StorageLevel.MEMORY_AND_DISK_SER

2015-02-10 Thread Marius Soutier
Unfortunately no. I just removed the persist statements to get the job to run, but now it sometimes fails with Job aborted due to stage failure: Task 162 in stage 2.1 failed 4 times, most recent failure: Lost task 162.3 in stage 2.1 (TID 1105, xxx.compute.internal):

Re: ImportError: No module named pyspark, when running pi.py

2015-02-10 Thread Felix C
Agree. PySpark would call spark-submit. Check out the command line there. --- Original Message --- From: Mohit Singh mohit1...@gmail.com Sent: February 9, 2015 11:26 PM To: Ashish Kumar ashish.ku...@innovaccer.com Cc: user@spark.apache.org Subject: Re: ImportError: No module named pyspark, when

Re: OutofMemoryError: Java heap space

2015-02-10 Thread Akhil Das
You could try increasing the driver memory. Also, can you be more specific about the data volume? Thanks Best Regards On Mon, Feb 9, 2015 at 3:30 PM, Yifan LI iamyifa...@gmail.com wrote: Hi, I just found the following errors during computation(graphx), anyone has ideas on this? thanks so

Re: Shuffle write increases in spark 1.2

2015-02-10 Thread chris
Hello, as the original message never got accepted to the mailinglist, I quote it here completely: Kevin Jung wrote Hi all, The size of shuffle write showing in spark web UI is much different when I execute same spark job on same input data(100GB) in both spark 1.1 and spark 1.2. At the

Re: Shuffle write increases in spark 1.2

2015-02-10 Thread chris
Hello, as the original message from Kevin Jung never got accepted to the mailinglist, I quote it here completely: Kevin Jung wrote Hi all, The size of shuffle write showing in spark web UI is much different when I execute same spark job on same input data(100GB) in both spark 1.1 and spark

Kafka Version Update 0.8.2 status?

2015-02-10 Thread critikaled
When can we expect the latest kafka and scala 2.11 support in spark streaming? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Version-Update-0-8-2-status-tp21573.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to efficiently utilize all cores?

2015-02-10 Thread matha.harika
Hi, I have a cluster setup with three slaves, 4 cores each(12 cores in total). When I try to run multiple applications, using 4 cores each, only the first application is running(with 2,1,1 cores used in corresponding slaves). Every other application is going to WAIT state. Following the solution

Re: Custom streaming receiver slow on YARN

2015-02-10 Thread Akhil Das
Not quiet sure, but one assumption would be that you are not having sufficient memory to hold that much of data and the process gets busy in cleaning the garbage and it could be the reason it works when you set MEMORY_AND_DISK_SER_2. Thanks Best Regards On Mon, Feb 9, 2015 at 8:38 PM, Jong Wook

map distribuited matrix (rowMatrix)

2015-02-10 Thread Donbeo
I have a rowMatrix x and I would like to apply a function to each element of x. I was thinking something likex map(u=exp(-u*u)) . How can I do something like that? -- View this message in context:

Re: OutofMemoryError: Java heap space

2015-02-10 Thread Yifan LI
Hi Akhil, Excuse me, I am trying a random-walk algorithm over a not that large graph(~1GB raw dataset, including ~5million vertices and ~60million edges) on a cluster which has 20 machines. And, the property of each vertex in graph is a hash map, of which size will increase dramatically

Re: How to efficiently utilize all cores?

2015-02-10 Thread Akhil Das
You can look at http://spark.apache.org/docs/1.2.0/job-scheduling.html I would go with mesos http://spark.apache.org/docs/1.2.0/running-on-mesos.html Thanks Best Regards On Tue, Feb 10, 2015 at 2:59 PM, matha.harika matha.har...@gmail.com wrote: Hi, I have a cluster setup with three slaves,

Re: Spark nature of file split

2015-02-10 Thread Brad
Have you been able to confirm this behaviour since posting? Have you tried this out on multiple workers and viewed their memory consumption? I'm new to Spark and don't have a cluster to play with at present, and want to do similar loading from NFS files. My understanding is that calls to

Re: OutofMemoryError: Java heap space

2015-02-10 Thread Yifan LI
Yes, I have read it, and am trying to find some way to do that… Thanks :) Best, Yifan LI On 10 Feb 2015, at 12:06, Akhil Das ak...@sigmoidanalytics.com wrote: Did you have a chance to look at this doc http://spark.apache.org/docs/1.2.0/tuning.html

Re: OutofMemoryError: Java heap space

2015-02-10 Thread Akhil Das
Did you have a chance to look at this doc http://spark.apache.org/docs/1.2.0/tuning.html Thanks Best Regards On Tue, Feb 10, 2015 at 4:13 PM, Yifan LI iamyifa...@gmail.com wrote: Hi Akhil, Excuse me, I am trying a random-walk algorithm over a not that large graph(~1GB raw dataset, including

Re: Kafka Version Update 0.8.2 status?

2015-02-10 Thread Sean Owen
I believe Kafka 0.8.2 is imminent for master / 1.4.0 -- see SPARK-2808. Anecdotally I have used Spark 1.2 with Kafka 0.8.2 without problem already. On Feb 10, 2015 10:53 AM, critikaled isasmani@gmail.com wrote: When can we expect the latest kafka and scala 2.11 support in spark streaming?

Spark SQL - Column name including a colon in a SELECT clause

2015-02-10 Thread presence2001
Hi list, I have some data with a field name of f:price (it's actually part of a JSON structure loaded from ElasticSearch via elasticsearch-hadoop connector, but I don't think that's significant here). I'm struggling to figure out how to express that in a Spark SQL SELECT statement without

Spark streaming job throwing ClassNotFound exception when recovering from checkpointing

2015-02-10 Thread Conor Fennell
I am getting the following error when I kill the spark driver and restart the job: 15/02/10 17:31:05 INFO CheckpointReader: Attempting to load checkpoint from file hdfs://hdfs-namenode.vagrant:9000/reporting/SampleJob$0.1.0/checkpoint-142358910.bk 15/02/10 17:31:05 WARN CheckpointReader:

Re: Re: Exception when trying to use EShadoop connector and writing rdd to ES

2015-02-10 Thread Costin Leau
What's the signature of your RDD? It looks to be a List which can't be mapped automatically to a document - you are probably thinking of a tuple or better yet a PairRDD. Convert your RDDList to a PairRDD and use that instead. This is a guess - a gist with a simple test/code would make it easier

FYI: Prof John Canny is giving a talk on Machine Learning at the limit in SF Big Analytics Meetup

2015-02-10 Thread Chester Chen
Just in case you are in San Francisco, we are having a meetup by Prof John Canny http://www.meetup.com/SF-Big-Analytics/events/220427049/ Chester

Re: Similar code in Java

2015-02-10 Thread Ted Yu
Please take a look at: examples/scala-2.10/src/main/java/org/apache/spark/examples/streaming/JavaDirectKafkaWordCount.java which was checked in yesterday. On Sat, Feb 7, 2015 at 10:53 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: Hi Ted, I’ve seen the codes, I am using

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Jon Gregg
They're separate in my code, how can I combine them? Here's what I have: val sparkConf = new SparkConf() val ssc = new StreamingContext(sparkConf, Seconds(bucketSecs)) val sc = new SparkContext() On Tue, Feb 10, 2015 at 1:02 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Is

pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread rok
I'm trying to use a broadcasted dictionary inside a map function and am consistently getting Java null pointer exceptions. This is inside an IPython session connected to a standalone spark cluster. I seem to recall being able to do this before but at the moment I am at a loss as to what to try

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Sandy Ryza
You should be able to replace that second line with val sc = ssc.sparkContext On Tue, Feb 10, 2015 at 10:04 AM, Jon Gregg jonrgr...@gmail.com wrote: They're separate in my code, how can I combine them? Here's what I have: val sparkConf = new SparkConf() val ssc = new

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Jon Gregg
OK that worked and getting close here ... the job ran successfully for a bit and I got output for the first couple buckets before getting a java.lang.Exception: Could not compute split, block input-0-1423593163000 not found error. So I bumped up the memory at the command line from 2 gb to 5 gb,

Exception when trying to use EShadoop connector and writing rdd to ES

2015-02-10 Thread shahid
INFO scheduler.TaskSetManager: Starting task 2.1 in stage 2.0 (TID 9, ip-10-80-98-118.ec2.internal, PROCESS_LOCAL, 1025 bytes) 15/02/10 15:54:08 INFO scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 6) on executor ip-10-80-15-145.ec2.internal: org.apache.spark.SparkException (Data of type

Why can't Spark find the classes in this Jar?

2015-02-10 Thread Abe Handler
I am new to spark. I am trying to compile and run a spark application that requires classes from an (external) jar file on my local machine. If I open the jar (on ~/Desktop) I can see the missing class in the local jar but when I run spark I get NoClassDefFoundError:

Open file limit settings for Spark on Yarn job

2015-02-10 Thread Arun Luthra
Hi, I'm running Spark on Yarn from an edge node, and the tasks on the run Data Nodes. My job fails with the Too many open files error once it gets to groupByKey(). Alternatively I can make it fail immediately if I repartition the data when I create the RDD. Where do I need to make sure that

Re: Spark on very small files, appropriate use case?

2015-02-10 Thread Davies Liu
Spark is an framework to do things in parallel very easy, it definitely will help your cases. def read_file(path): lines = open(path).readlines() # bzip2 return lines filesRDD = sc.parallelize(path_to_files, N) lines = filesRDD.flatMap(read_file) Then you could do other transforms on

spark sql registerFunction with 1.2.1

2015-02-10 Thread Mohnish Kodnani
Hi, I am trying a very simple registerFunction and it is giving me errors. I have a parquet file which I register as temp table. Then I define a UDF. def toSeconds(timestamp: Long): Long = timestamp/10 sqlContext.registerFunction(toSeconds, toSeconds _) val result = sqlContext.sql(select

Re: Spark on very small files, appropriate use case?

2015-02-10 Thread Kelvin Chu
I had a similar use case before. I found: 1. textFile() produced one partition per file. It can result in many partitions. I found that calling coalecse() without shuffle helped. 2. If you used persist(), count() will do I/O and put the result into cache. Transformation later did computation out

SparkSQL + Tableau Connector

2015-02-10 Thread Todd Nist
Hi, I'm trying to understand how and what the Tableau connector to SparkSQL is able to access. My understanding is it needs to connect to the thriftserver and I am not sure how or if it exposes parquet, json, schemaRDDs, or does it only expose schemas defined in the metastore / hive. For

Spark on very small files, appropriate use case?

2015-02-10 Thread soupacabana
Hi all, I have the following use case: One job consists of reading from 500-2000 small bzipped logs that are on an nfs. (Small means, that the zipped logs are between 0-100KB, average file size is 20KB.) We read the log lines, do some transformations, and write them to one output file. When we

Re: Can spark job server be used to visualize streaming data?

2015-02-10 Thread Kelvin Chu
Hi Su, Out of the box, no. But, I know people integrate it with Spark Streaming to do real-time visualization. It will take some work though. Kelvin On Mon, Feb 9, 2015 at 5:04 PM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I was reading this blog post:

spark python exception

2015-02-10 Thread Kane Kim
sometimes I'm getting this exception: Traceback (most recent call last): File /opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/daemon.py, line 162, in manager code = worker(sock) File /opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/daemon.py, line 64, in worker outfile.flush() IOError:

Re: hadoopConfiguration for StreamingContext

2015-02-10 Thread Marc Limotte
Looks like the latest version 1.2.1 actually does use the configured hadoop conf. I tested it out and that does resolve my problem. thanks, marc On Tue, Feb 10, 2015 at 10:57 AM, Marc Limotte mslimo...@gmail.com wrote: Thanks, Akhil. I had high hopes for #2, but tried all and no luck. I

Spark streaming job throwing ClassNotFound exception when recovering from checkpointing

2015-02-10 Thread Conor Fennell
I am getting the following error when I kill the spark driver and restart the job: 15/02/10 17:31:05 INFO CheckpointReader: Attempting to load checkpoint from file hdfs://hdfs-namenode.vagrant:9000/reporting/SampleJob$0.1.0/checkpoint-142358910.bk 15/02/10 17:31:05 WARN CheckpointReader:

Spark streaming job throwing ClassNotFound exception when recovering from checkpointing

2015-02-10 Thread conor
I am getting the following error when I kill the spark driver and restart the job: 15/02/10 17:31:05 INFO CheckpointReader: Attempting to load checkpoint from file hdfs://hdfs-namenode.vagrant:9000/reporting/SampleJob$0.1.0/checkpoint-142358910.bk 15/02/10 17:31:05 WARN CheckpointReader:

Error while querying hive table from spark shell

2015-02-10 Thread kundan kumar
Hi , I am getting the following error when I am trying query a hive table from spark shell. I have placed my hive-site.xml in the spark/conf directory. Please suggest how to resolve this error. scala sqlContext.sql(select count(*) from offers_new).collect().foreach(println) 15/02/11 01:48:01

Re: Beginner in Spark

2015-02-10 Thread Akhil Das
You may also go through these posts https://docs.sigmoidanalytics.com/index.php/Spark_Installation Thanks Best Regards On Fri, Feb 6, 2015 at 9:39 PM, King sami kgsam...@gmail.com wrote: Hi, I'm new in Spark, I'd like to install Spark with Scala. The aim is to build a data processing system

Re: org.apache.spark.SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up.

2015-02-10 Thread Akhil Das
See this answer by Josh http://stackoverflow.com/questions/26692658/cant-connect-from-application-to-the-standalone-cluster You may also find this post useful http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3c7a889b1c-aa14-4cf2-8375-37f9cf827...@gmail.com%3E Thanks Best Regards

Naive Bayes model fails after a few predictions

2015-02-10 Thread rkgurram
Hi, I have built a Sentiment Analyzer using the Naive Bayes model, the model works fine by learning from a list of 200 movie reviews and correctly predicting with an accuracy of close to 77% to 80%. After a while of predicting I get the following stacktrace... By the way...I have only one

R: Datastore HDFS vs Cassandra

2015-02-10 Thread Paolo Platter
Hi Mike, I developed a Solution with cassandra and spark, using DSE. The main difficult is about cassandra, you need to understand very well its data model and its Query patterns. Cassandra has better performance than hdfs and it has DR and stronger availability. Hdfs is a filesystem, cassandra

Re: Beginner in Spark

2015-02-10 Thread prabeesh k
Refer this blog http://blog.prabeeshk.com/blog/2014/10/31/install-apache-spark-on-ubuntu-14-dot-04/ for step by step installation of Spark on Ubuntu On 7 February 2015 at 03:12, Matei Zaharia matei.zaha...@gmail.com wrote: You don't need HDFS or virtual machines to run Spark. You can just

Re: Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-10 Thread Reynold Xin
I think we made the binary protocol compatible across all versions, so you should be fine with using any one of them. 1.2.1 is probably the best since it is the most recent stable release. On Tue, Feb 10, 2015 at 8:43 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I need to use

Re: spark, reading from s3

2015-02-10 Thread Akhil Das
Its with the timezone actually, you can either use an NTP to maintain accurate system clock or you can adjust your system time to match with the AWS one. You can do it as: telnet s3.amazonaws.com 80 GET / HTTP/1.0 [image: Inline image 1] Thanks Best Regards On Wed, Feb 11, 2015 at 6:43 AM,

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Arush Kharbanda
I am a little confused here, why do you want to create the tables in hive. You want to create the tables in spark-sql, right? If you are not able to find the same tables through tableau then thrift is connecting to a diffrent metastore than your spark-shell. One way to specify a metstore to

Re: Error while querying hive table from spark shell

2015-02-10 Thread Arush Kharbanda
Seems that the HDFS path for the table dosnt contains any file/data. Does the metastore contain the right path for HDFS data. You can find the HDFS path in TBLS in your metastore. On Wed, Feb 11, 2015 at 12:20 PM, kundan kumar iitr.kun...@gmail.com wrote: Hi , I am getting the following

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Arush Kharbanda
BTW what tableau connector are you using? On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: I am a little confused here, why do you want to create the tables in hive. You want to create the tables in spark-sql, right? If you are not able to find the same

Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-10 Thread Jianshi Huang
Hi, I need to use branch-1.2 and sometimes master builds of Spark for my project. However the officially supported Spark version by our Hadoop admin is only 1.2.0. So, my question is which version/build of spark-yarn-shuffle.jar should I use that works for all four versions? (1.2.0, 1.2.1,

Re: can we insert and update with spark sql

2015-02-10 Thread Michael Armbrust
You should look at https://github.com/amplab/spark-indexedrdd On Tue, Feb 10, 2015 at 2:27 PM, Debasish Das debasish.da...@gmail.com wrote: Hi Michael, I want to cache a RDD and define get() and set() operators on it. Basically like memcached. Is it possible to build a memcached like

Re: Will Spark serialize an entire Object or just the method referred in an object?

2015-02-10 Thread Marcelo Vanzin
Hi Yitong, It's not as simple as that. In your very simple example, the only things referenced by the closure are (i) the input arguments and (ii) a Scala object. So there are no external references to serialize in that case, just the closure instance itself - see, there is still something being

Re: can we insert and update with spark sql

2015-02-10 Thread Debasish Das
Hi Michael, I want to cache a RDD and define get() and set() operators on it. Basically like memcached. Is it possible to build a memcached like distributed cache using Spark SQL ? If not what do you suggest we should use for such operations... Thanks. Deb On Fri, Jul 18, 2014 at 1:00 PM,

Re: Beginner in Spark

2015-02-10 Thread King sami
2015-02-06 17:28 GMT+00:00 King sami kgsam...@gmail.com: The purpose is to build a data processing system for door events. An event will describe a door unlocking with a badge system. This event will differentiate unlocking by somebody from the inside and by somebody from the outside.

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread Rok Roskar
I get this in the driver log: java.lang.NullPointerException at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233) at

Re: ZeroMQ and pyspark.streaming

2015-02-10 Thread Arush Kharbanda
No, zeromq api is not supported in python as of now. On 5 Feb 2015 21:27, Sasha Kacanski skacan...@gmail.com wrote: Does pyspark supports zeroMQ? I see that java does it, but I am not sure for Python? regards -- Aleksandar Kacanski

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Arush Kharbanda
1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? NO 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? Create a table in spark sql to expose via spark sql 3. Does the thriftserver need to be configured to expose

Re: spark sql registerFunction with 1.2.1

2015-02-10 Thread Michael Armbrust
The simple SQL parser doesn't yet support UDFs. Try using a HiveContext. On Tue, Feb 10, 2015 at 1:44 PM, Mohnish Kodnani mohnish.kodn...@gmail.com wrote: Hi, I am trying a very simple registerFunction and it is giving me errors. I have a parquet file which I register as temp table. Then I

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Silvio Fiorito
Hi Todd, What you could do is run some SparkSQL commands immediately after the Thrift server starts up. Or does Tableau have some init SQL commands you could run? You can actually load data using SQL, such as: create temporary table people using org.apache.spark.sql.json options (path

Spark Summit East - March 18-19 - NYC

2015-02-10 Thread Scott walent
The inaugural Spark Summit East, an event to bring the Apache Spark community together, will be in New York City on March 18, 2015. We are excited about the growth of Spark and to bring the event to the east coast. At Spark Summit East you can look forward to hearing from Matei Zaharia,

hadoopConfiguration for StreamingContext

2015-02-10 Thread Marc Limotte
I see that StreamingContext has a hadoopConfiguration() method, which can be used like this sample I found: sc.hadoopConfiguration().set(fs.s3.awsAccessKeyId, XX); sc.hadoopConfiguration().set(fs.s3.awsSecretAccessKey, XX); But StreamingContext doesn't have the same thing. I want to

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Silvio Fiorito
Todd, I just tried it in bin/spark-sql shell. I created a folder json and just put 2 copies of the same people.json file This is what I ran: spark-sql create temporary table people using org.apache.spark.sql.json options (path 'examples/src/main/resources/json/*')

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread Davies Liu
It's brave to broadcast 8G pickled data, it will take more than 15G in memory for each Python worker, how much memory do you have in executor and driver? Do you see any other exceptions in driver and executors? Something related to serialization in JVM. On Tue, Feb 10, 2015 at 2:16 PM, Rok Roskar

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Todd Nist
Arush, As for #2 do you mean something like this from the docs: // sc is an existing SparkContext.val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))sqlContext.sql(LOAD DATA LOCAL INPATH

Spark installation

2015-02-10 Thread King sami
Hi, I'm new in Spark. I want to install it on my local machine (Ubunti 12.04) Could you help me please to install step by step Spark on may machine and run some Scala programms. Thanks

Re: can we insert and update with spark sql

2015-02-10 Thread Debasish Das
Thanks...this is what I was looking for... It will be great if Ankur can give brief details about it...Basically how does it contrast with memcached for example... On Tue, Feb 10, 2015 at 2:32 PM, Michael Armbrust mich...@databricks.com wrote: You should look at

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Todd Nist
Arush, Thank you will take a look at that approach in the morning. I sort of figured the answer to #1 was NO and that I would need to do 2 and 3 thanks for clarifying it for me. -Todd On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: 1. Can the connector

Re: can we insert and update with spark sql

2015-02-10 Thread Debasish Das
Also I wanted to run get() and set() from mapPartitions (from spark workers and not master)... To be able to do that I think I have to create a separate spark context for the cache... But I am not sure how SparkContext from job1 can access SparkContext from job2 ! On Tue, Feb 10, 2015 at 3:25

Re: Spark installation

2015-02-10 Thread Mohit Singh
For local machine, I dont think there is any to install.. Just unzip and go to $SPARK_DIR/bin/spark-shell and that will open up a repl... On Tue, Feb 10, 2015 at 3:25 PM, King sami kgsam...@gmail.com wrote: Hi, I'm new in Spark. I want to install it on my local machine (Ubunti 12.04) Could

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Todd Nist
Hi Silvio, Ah, I like that, there is a section in Tableau for Initial SQL to be executed upon connecting this would fit well there. I guess I will need to issue a collect(), coalesce(1,true).saveAsTextFile(...) or use repartition(1), as the file currently is being broken into multiple parts.

Re: Open file limit settings for Spark on Yarn job

2015-02-10 Thread Sandy Ryza
Hi Arun, The limit for the YARN user on the cluster nodes should be all that matters. What version of Spark are you using? If you can turn on sort-based shuffle it should solve this problem. -Sandy On Tue, Feb 10, 2015 at 1:16 PM, Arun Luthra arun.lut...@gmail.com wrote: Hi, I'm running

Re: Writable serialization from InputFormat losing fields

2015-02-10 Thread Corey Nolet
I am able to get around the problem by doing a map and getting the Event out of the EventWritable before I do my collect. I think I'll do that for now. On Tue, Feb 10, 2015 at 6:04 PM, Corey Nolet cjno...@gmail.com wrote: I am using an input format to load data from Accumulo [1] in to a Spark

spark, reading from s3

2015-02-10 Thread Kane Kim
I'm getting this warning when using s3 input: 15/02/11 00:58:37 WARN RestStorageService: Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately 0 seconds. Retrying connection. After that there are tons of 403/forbidden