Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-15 Thread Nathan McCarthy
Thanks Cheng! Is there any API I can get access too (e.g. ParquetTableScan) which would allow me to load up the low level/baseRDD of just RDD[Row] so I could avoid the defensive copy (maybe lose our on columnar storage etc.). We have parts of our pipeline using SparkSQL/SchemaRDDs and others us

MatchError in JsonRDD.toLong

2015-01-15 Thread Tobias Pfeiffer
Hi, I am experiencing a weird error that suddenly popped up in my unit tests. I have a couple of HDFS files in JSON format and my test is basically creating a JsonRDD and then issuing a very simple SQL query over it. This used to work fine, but now suddenly I get: 15:58:49.039 [Executor task laun

Re: Low Level Kafka Consumer for Spark

2015-01-15 Thread Akhil Das
There was a simple example which you can run after changing few lines of configurations. Thanks Best Regards On Fri, Jan 16, 2015 at 12:23 PM, Dibyendu Bhattacharya < dibyendu.bhattach...

UserGroupInformation: PriviledgedActionException as

2015-01-15 Thread spraveen
Hi, When I am trying to run a program in a remote spark machine I am getting this below exception : 15/01/16 11:14:39 ERROR UserGroupInformation: PriviledgedActionException as:user1 (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread

RE: Spark SQL Custom Predicate Pushdown

2015-01-15 Thread Cheng, Hao
The Data Source API probably work for this purpose. It support the column pruning and the Predicate Push Down: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala Examples also can be found in the unit test: https://github.com/apache/sp

Re: Low Level Kafka Consumer for Spark

2015-01-15 Thread Dibyendu Bhattacharya
Hi Kidong, Just now I tested the Low Level Consumer with Spark 1.2 and I did not see any issue with Receiver.Store method . It is able to fetch messages form Kafka. Can you cross check other configurations in your setup like Kafka broker IP , topic name, zk host details, consumer id etc. Dib On

Re: PySpark saveAsTextFile gzip

2015-01-15 Thread Akhil Das
You can use the saveAsNewAPIHadoop file. You can use it for compressing your output, here's a sample code to use the AP

Re: How to query Spark master for cluster status ?

2015-01-15 Thread Akhil Das
There's a JSON end point in the Web UI ( that running on port 8080), http://masterip:8080/json/ Thanks Best Regards On Thu, Jan 15, 2015 at 6:30 PM, Shing Hing Man wrote: > Hi, > > I am using Spark 1.2. The Spark master UI has a status. > Is there a web service on the Spark master that retur

Re: OutOfMemory error in Spark Core

2015-01-15 Thread Akhil Das
Did you try increasing the parallelism? Thanks Best Regards On Fri, Jan 16, 2015 at 10:41 AM, Anand Mohan wrote: > We have our Analytics App built on Spark 1.1 Core, Parquet, Avro and Spray. > We are using Kryo serializer for the Avro objects read from Parquet and we > are using our custom Kryo

Re: Low Level Kafka Consumer for Spark

2015-01-15 Thread Dibyendu Bhattacharya
Hi Kidong, No , I have not tried yet with Spark 1.2 yet. I will try this out and let you know how this goes. By the way, is there any change in Receiver Store method happened in Spark 1.2 ? Regards, Dibyendu On Fri, Jan 16, 2015 at 11:25 AM, mykidong wrote: > Hi Dibyendu, > > I am using k

Re: Low Level Kafka Consumer for Spark

2015-01-15 Thread mykidong
Hi Dibyendu, I am using kafka 0.8.1.1 and spark 1.2.0. After modifying these version of your pom, I have rebuilt your codes. But I have not got any messages from ssc.receiverStream(new KafkaReceiver(_props, i)). I have found, in your codes, all the messages are retrieved correctly, but _receiver.

Spark SQL Custom Predicate Pushdown

2015-01-15 Thread Corey Nolet
I have document storage services in Accumulo that I'd like to expose to Spark SQL. I am able to push down predicate logic to Accumulo to have it perform only the seeks necessary on each tablet server to grab the results being asked for. I'm interested in using Spark SQL to push those predicates do

OutOfMemory error in Spark Core

2015-01-15 Thread Anand Mohan
We have our Analytics App built on Spark 1.1 Core, Parquet, Avro and Spray. We are using Kryo serializer for the Avro objects read from Parquet and we are using our custom Kryo registrator (along the lines of ADAM

Re: Some tasks are taking long time

2015-01-15 Thread Ajay Srivastava
Thanks Nicos.GC does not contribute much to the execution time of the task. I will debug it further today. Regards,Ajay On Thursday, January 15, 2015 11:55 PM, Nicos wrote: Ajay, Unless we are dealing with some synchronization/conditional variable bug in Spark, try this per tuning

Re: Accumulators

2015-01-15 Thread Imran Rashid
You're understanding is basically correct. Each task creates it's own local accumulator, and just those results get merged together. However, there are some performance limitations to be aware of. First you need enough memory on the executors to build up whatever those intermediate results are.

Re: Testing if an RDD is empty?

2015-01-15 Thread Sean Owen
How about checking whether take(1).length == 0? If I read the code correctly, this will only examine the first partition, at least. On Fri, Jan 16, 2015 at 4:12 AM, Tobias Pfeiffer wrote: > Hi, > > On Fri, Jan 16, 2015 at 7:31 AM, freedafeng wrote: >> >> I myself saw many times that my app threw

Re: Executor vs Mapper in Hadoop

2015-01-15 Thread Sean Owen
An executor is specific to a Spark application, just as a mapper is specific to a MapReduce job. So a machine will usually be running many executors, and each is a JVM. A Mapper is single-threaded; an executor can run many tasks (possibly from different jobs within the application) at once. Yes, 5

Re: Testing if an RDD is empty?

2015-01-15 Thread Tobias Pfeiffer
Hi, On Fri, Jan 16, 2015 at 7:31 AM, freedafeng wrote: > > I myself saw many times that my app threw out exceptions because an empty > RDD cannot be saved. This is not big issue, but annoying. Having a cheap > solution testing if an RDD is empty would be nice if there is no such thing > available

Re: How to force parallel processing of RDD using multiple thread

2015-01-15 Thread Sean Owen
Check the number of partitions in your input. It may be much less than the available parallelism of your small cluster. For example, input that lives in just 1 partition will spawn just 1 task. Beyond that parallelism just happens. You can see the parallelism of each operation in the Spark UI. On

Re: Anyway to make RDD preserve input directories structures?

2015-01-15 Thread Sean Owen
Maybe you are saying you already do this, but it's perfectly possible to process as many RDDs as you like in parallel on the driver. That may allow your current approach to eat up as much parallelism as you like. I'm not sure if that's what you are describing with "submit multi applications" but yo

RE: using hiveContext to select a nested Map-data-type from an AVROmodel+parquet file

2015-01-15 Thread Cheng, Hao
Hi, BB Ideally you can do the query like: select key, value.percent from mytable_data lateral view explode(audiences) f as key, value limit 3; But there is a bug in HiveContext: https://issues.apache.org/jira/browse/SPARK-5237 I am working on it now, hopefully make a patch soon. Cheng H

Anyway to make RDD preserve input directories structures?

2015-01-15 Thread 逸君曹
say there's some logs: s3://log-collections/sys1/20141212/nginx.gz s3://log-collections/sys1/20141213/nginx-part-1.gz s3://log-collections/sys1/20141213/nginx-part-2.gz I have a function that parse the logs for later analysis. I want to parse all the files. So I do this: logs = sc.textFile('s3:/

Re: saveAsTextFile

2015-01-15 Thread ankits
I have seen this happen when the RDD contains null values. Essentially, saveAsTextFile calls toString() on the elements of the RDD, so a call to null.toString will result in an NPE. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p21178

RE: Issue with Parquet on Spark 1.2 and Amazon EMR

2015-01-15 Thread Bozeman, Christopher
Thanks to Aniket’s work there is two new options to the EMR install script for Spark. See https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/README.md The “-a” option can be used to bump the spark-assembly to the front of the classpath. -Christopher From: Aniket Bhatnagar [

spark streaming python files not packaged in assembly jar

2015-01-15 Thread jamborta
Hi all, just discovered that the streaming folder in pyspark is not included in the assembly jar (spark-assembly-1.2.0-hadoop2.3.0.jar), but included in the python folder. Any reason why? Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-strea

Re: save spark streaming output to single file on hdfs

2015-01-15 Thread jamborta
thanks for the replies. very useful. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-tp21124p21176.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: different akka versions and spark

2015-01-15 Thread Koert Kuipers
and to make things even more interesting: The CDH *5.3* version of Spark 1.2 differs from the Apache Spark 1.2 release in using Akka version 2.2.3, the version used by Spark 1.1 and CDH 5.2. Apache Spark 1.2 uses Akka version 2.3.4. so i just compiled a program that uses akka against apache spark

RE: How to force parallel processing of RDD using multiple thread

2015-01-15 Thread Wang, Ningjun (LNG-NPV)
Spark Standalone cluster. My program is running very slow, I suspect it is not doing parallel processing of rdd. How can I force it to run parallel? Is there anyway to check whether it is processed in parallel? Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New

Re: Testing if an RDD is empty?

2015-01-15 Thread freedafeng
I think Sampo's thought is to get a function that only tests if a RDD is empty. He does not want to know the size of the RDD, and getting the size of a RDD is expensive for large data sets. I myself saw many times that my app threw out exceptions because an empty RDD cannot be saved. This is not

Re: How to force parallel processing of RDD using multiple thread

2015-01-15 Thread Sean Owen
What is your cluster manager? For example on YARN you would specify --executor-cores. Read: http://spark.apache.org/docs/latest/running-on-yarn.html On Thu, Jan 15, 2015 at 8:54 PM, Wang, Ningjun (LNG-NPV) wrote: > I have a standalone spark cluster with only one node with 4 CPU cores. How > can I

RE: Executor parameter doesn't work for Spark-shell on EMR Yarn

2015-01-15 Thread Shuai Zheng
I figure out the second question, because if I don't pass in the num of partition for the test data, it will by default assume has max executors (although I don't know what is this default max num). val lines = sc.parallelize(List("-240990|161327,9051480,0,2,30.48,75", "-240990|161324,9051480,0

Distributing Computation across slaves

2015-01-15 Thread Steve Lewis
I have a job involving two sets of data indexed with the same type of key. I have an expensive operation that I want to run on pairs sharing the same key. The following code works BUT all of the work is being done on 3 of 16 processors - How do I go about diagnosing and fixing the behavior. A

RE: Executor parameter doesn't work for Spark-shell on EMR Yarn

2015-01-15 Thread Shuai Zheng
Forget to mention, I use EMR AMI 3.3.1, Spark 1.2.0. Yarn 2.4. The spark is setup by the standard script: s3://support.elasticmapreduce/spark/install-spark From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Thursday, January 15, 2015 3:52 PM To: user@spark.apache.org Subject: Executor p

How to force parallel processing of RDD using multiple thread

2015-01-15 Thread Wang, Ningjun (LNG-NPV)
I have a standalone spark cluster with only one node with 4 CPU cores. How can I force spark to do parallel processing of my RDD using multiple threads? For example I can do the following Spark-submit --master local[4] However I really want to use the cluster as follow Spark-submit --master

Executor parameter doesn't work for Spark-shell on EMR Yarn

2015-01-15 Thread Shuai Zheng
Hi All, I am testing Spark on EMR cluster. Env is a one node cluster r3.8xlarge. Has 32 vCore and 244G memory. But the command line I use to start up spark-shell, it can't work. For example: ~/spark/bin/spark-shell --jars /home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/*.jar --num-execut

Executor vs Mapper in Hadoop

2015-01-15 Thread Shuai Zheng
Hi All, I try to clarify some behavior in the spark for executor. Because I am from Hadoop background, so I try to compare it to the Mapper (or reducer) in hadoop. 1, Each node can have multiple executors, each run in its own process? This is same as mapper process. 2, I thought the sp

Re: SQL JSON array operations

2015-01-15 Thread jvuillermet
yeah that's where I ended up. Thanks ! I'll give it a try. On Thu, Jan 15, 2015 at 8:46 PM, Ayoub [via Apache Spark User List] < ml-node+s1001560n21172...@n3.nabble.com> wrote: > You could try to use hive context which bring HiveQL, it would allow you > to query nested structures using "LATERAL V

Re: SQL JSON array operations

2015-01-15 Thread Ayoub
You could try to use hive context which bring HiveQL, it would allow you to query nested structures using "LATERAL VIEW explode..." see doc here -- View this message in context: http://apache-spark-user-list.100

Re: Error when running SparkPi on Secure HA Hadoop cluster

2015-01-15 Thread Marcelo Vanzin
You're specifying the queue in the spark-submit command line: --queue thequeue Are you sure that queue exists? On Thu, Jan 15, 2015 at 11:23 AM, Manoj Samel wrote: > Hi, > > Setup is as follows > > Hadoop Cluster 2.3.0 (CDH5.0) > - Namenode HA > - Resource manager HA > - Secured with Kerbero

Error when running SparkPi on Secure HA Hadoop cluster

2015-01-15 Thread Manoj Samel
Hi, Setup is as follows Hadoop Cluster 2.3.0 (CDH5.0) - Namenode HA - Resource manager HA - Secured with Kerberos Spark 1.2 Run SparkPi as follows - conf/spark-defaults.conf has following entries spark.yarn.queue myqueue spark.yarn.access.namenodes hdfs://namespace (remember this is namenode HA

Re: Some tasks are taking long time

2015-01-15 Thread Nicos
Ajay, Unless we are dealing with some synchronization/conditional variable bug in Spark, try this per tuning guide: Cache Size Tuning One important configuration parameter for GC is the amount of memory that should be used for caching RDDs. By default, Spark uses 60% of the configured e

Re: Is spark suitable for large scale pagerank, such as 200 million nodes, 2 billion edges?

2015-01-15 Thread Ted Yu
Have you seen http://search-hadoop.com/m/JW1q5pE3P12 ? Please also take a look at the end-to-end performance graph on http://spark.apache.org/graphx/ Cheers On Thu, Jan 15, 2015 at 9:29 AM, txw wrote: > Hi, > > > I am run PageRank on a large dataset, which include 200 million nodes and > 2 bil

Re: dockerized spark executor on mesos?

2015-01-15 Thread Tim Chen
Just throwing this out here, there is existing PR to add docker support for spark framework to launch executors with docker image. https://github.com/apache/spark/pull/3074 Hopefully this will be merged sometime. Tim On Thu, Jan 15, 2015 at 9:18 AM, Nicholas Chammas < nicholas.cham...@gmail.com

Re: Running "beyond memory limits" in ConnectedComponents

2015-01-15 Thread Sean Owen
If you give the executor 22GB, it will run with "... -Xmx22g". If the JVM heap gets nearly full, it will almost certainly consume more than 22GB of physical memory, because the JVM needs memory for more than just heap. But in this scenario YARN was only asked for 22GB and it gets killed. This is ex

Re: spark crashes on second or third call first() on file

2015-01-15 Thread Davies Liu
What's the version of Spark you are using? On Wed, Jan 14, 2015 at 12:00 AM, Linda Terlouw wrote: > I'm new to Spark. When I use the Movie Lens dataset 100k > (http://grouplens.org/datasets/movielens/), Spark crashes when I run the > following code. The first call to movieData.first() gives the c

Re: Running "beyond memory limits" in ConnectedComponents

2015-01-15 Thread Nitin kak
Replying to all Is this "Overhead memory" allocation used for any specific purpose. For example, will it be any different if I do *"--executor-memory 22G" *with overhead set to 0%(hypothetically) vs "*--executor-memory 20G*" and overhead memory to default(9%) which eventually brings the total

Re: Testing if an RDD is empty?

2015-01-15 Thread Al M
You can also check rdd.partitions.size. This will be 0 for empty RDDs and > 0 for RDDs with data. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-tp1678p21170.html Sent from the Apache Spark User List mailing list archive at Nabbl

Is spark suitable for large scale pagerank, such as 200 million nodes, 2 billion edges?

2015-01-15 Thread txw
Hi, I am run PageRank on a large dataset, which include 200 million nodes and 2 billion edges? Isspark suitable for large scale pagerank? How many cores and MEM do I need and how long will it take? Thanks Xuewei Tang

Re: dockerized spark executor on mesos?

2015-01-15 Thread Nicholas Chammas
The AMPLab maintains a bunch of Docker files for Spark here: https://github.com/amplab/docker-scripts Hasn't been updated since 1.0.0, but might be a good starting point. On Wed Jan 14 2015 at 12:14:13 PM Josh J wrote: > We have dockerized Spark Master and worker(s) separately and are using it

Re: SQL JSON array operations

2015-01-15 Thread Ayoub Benali
You could try yo use hive context which bring HiveQL, it would allow you to query nested structures using "LATERAL VIEW explode..." On Jan 15, 2015 4:03 PM, "jvuillermet" wrote: > let's say my json file lines looks like this > > {"user": "baz", "tags" : ["foo", "bar"] } > > > sqlContext.json

Re: Running "beyond memory limits" in ConnectedComponents

2015-01-15 Thread Sean Owen
This is a YARN setting. It just controls how much any container can reserve, including Spark executors. That is not the problem. You need Spark to ask for more memory from YARN, on top of the memory that is requested by --executor-memory. Your output indicates the default of 7% is too little. For

Re: small error in the docs?

2015-01-15 Thread Sean Owen
Yes that's a typo. The API docs and source code are correct though. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions That and your IDE should show the correct signature. You can open a PR to fix the typo in https://spark.apache.org/docs/latest/programm

using hiveContext to select a nested Map-data-type from an AVROmodel+parquet file

2015-01-15 Thread BB
Hi all, Any help on the following is very much appreciated. = Problem: On a schemaRDD read from a parquet file (data within file uses AVRO model) using the HiveContext: I can't figure out how to 'select' or use 'where' clause, to filter rows on a field that

small error in the docs?

2015-01-15 Thread kirillfish
cogroup() function seems to return (K, (Iterable, Iterable)), rather than (K, Iterable, Iterable), as it is pointed out in the docs (at least for version 1.1.0): https://spark.apache.org/docs/1.1.0/programming-guide.html This simple

Re: save spark streaming output to single file on hdfs

2015-01-15 Thread Prannoy
Hi, You can use FileUtil.copyMerge API and specify the path to the folder where saveAsTextFile is save the part text file. Suppose your directory is /a/b/c/ use FileUtil.copyMerge(FileSystem of source, a/b/c, FileSystem of destination, Path to the merged file say (a/b/c.txt), true(to delete the

Re: Running "beyond memory limits" in ConnectedComponents

2015-01-15 Thread Nitin kak
I am sorry for the formatting error, the value for *yarn.scheduler.maximum-allocation-mb = 28G* On Thu, Jan 15, 2015 at 11:31 AM, Nitin kak wrote: > Thanks for sticking to this thread. > > I am guessing what memory my app requests and what Yarn requests on my > part should be same and is determi

Re: Running "beyond memory limits" in ConnectedComponents

2015-01-15 Thread Nitin kak
Thanks for sticking to this thread. I am guessing what memory my app requests and what Yarn requests on my part should be same and is determined by the value of *--executor-memory* which I had set to *20G*. Or can the two values be different? I checked in Yarn configurations(below), so I think th

Re: Inserting an element in RDD[String]

2015-01-15 Thread Hafiz Mujadid
thanks On Thu, Jan 15, 2015 at 7:35 PM, Prannoy [via Apache Spark User List] < ml-node+s1001560n21163...@n3.nabble.com> wrote: > Hi, > > You can take the schema line in another rdd and than do a union of the two > rdd . > > List schemaList = new ArrayList; > schemaList.add("xyz"); > > // where xy

Re: Running "beyond memory limits" in ConnectedComponents

2015-01-15 Thread Sean Owen
Those settings aren't relevant, I think. You're concerned with what your app requests, and what Spark requests of YARN on your behalf. (Of course, you can't request more than what your cluster allows for a YARN container for example, but that doesn't seem to be what is happening here.) You do not

SQL JSON array operations

2015-01-15 Thread jvuillermet
let's say my json file lines looks like this {"user": "baz", "tags" : ["foo", "bar"] } sqlContext.jsonFile("data.json") ... How could I query for user with "bar" tags using SQL sqlContext.sql("select user from users where tags ?contains? 'bar' ") I could simplify the request and use the re

Re: Some tasks are taking long time

2015-01-15 Thread Ajay Srivastava
Thanks RK. I can turn on speculative execution but I am trying to find out actual reason for delay as it happens on any node. Any idea about the stack trace in my previous mail. Regards,Ajay On Thursday, January 15, 2015 8:02 PM, RK wrote: If you don't want a few slow tasks to slow

Re: Inserting an element in RDD[String]

2015-01-15 Thread Prannoy
Hi, You can take the schema line in another rdd and than do a union of the two rdd . List schemaList = new ArrayList; schemaList.add("xyz"); // where xyz is your schema line JavaRDD schemaRDD = sc.parallize(schemaList) ; //where sc is your sparkcontext JavaRDD newRDD = schemaRDD.union(yourRD

Re: Some tasks are taking long time

2015-01-15 Thread RK
If you don't want a few slow tasks to slow down the entire job, you can turn on speculation.  Here are the speculation settings from Spark Configuration - Spark 1.2.0 Documentation. |   | |   |   |   |   |   | | Spark Configuration - Spark 1.2.0 DocumentationSpark Configuration Spark Properties

Re: Inserting an element in RDD[String]

2015-01-15 Thread Aniket Bhatnagar
Sure there is. Create a new RDD just containing the schema line (hint: use sc.parallelize) and then union both the RDDs (the header RDD and data RDD) to get a final desired RDD. On Thu Jan 15 2015 at 19:48:52 Hafiz Mujadid wrote: > hi experts! > > I hav an RDD[String] and i want to add schema li

Re: How to define SparkContext with Cassandra connection for spark-jobserver?

2015-01-15 Thread abhishek
In the spark job server* bin *folder, you will find* application.conf* file, put context-settings { spark.cassandra.connection.host = } Hope this should work -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-define-SparkContext-with-

Inserting an element in RDD[String]

2015-01-15 Thread Hafiz Mujadid
hi experts! I hav an RDD[String] and i want to add schema line at beginning in this rdd. I know RDD is immutable. So is there anyway to have a new rdd with one schema line and contents of previous rdd? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.co

Re: Failing jobs runs twice

2015-01-15 Thread Anders Arpteg
Found a setting that seems to fix this problem, but it does not seems to be available until Spark 1.3. See https://issues.apache.org/jira/browse/SPARK-2165 However, glad to see a work is being done with the issue. On Tue, Jan 13, 2015 at 8:00 PM, Anders Arpteg wrote: > Yes Andrew, I am. Tried s

Re: saveAsTextFile

2015-01-15 Thread Prannoy
Hi, Before saving the rdd do a collect to the rdd and print the content of the rdd. Probably its a null value. Thanks. On Sat, Jan 3, 2015 at 5:37 PM, Pankaj Narang [via Apache Spark User List] < ml-node+s1001560n20953...@n3.nabble.com> wrote: > If you can paste the code here I can certainly he

Some tasks are taking long time

2015-01-15 Thread Ajay Srivastava
Hi, My spark job is taking long time. I see that some tasks are taking longer time for same amount of data and shuffle read/write. What could be the possible reasons for it ? The thread-dump sometimes show that all the tasks in an executor are waiting with following stack trace - "Executor task

How to query Spark master for cluster status ?

2015-01-15 Thread Shing Hing Man
Hi,   I am using Spark 1.2.  The Spark master UI has a status.Is there a web service on the Spark master that returns the status of the cluster in Json ? Alternatively, what is the best way to determine if  a cluster is up. Thanks in advance for your assistance! Shing

Re: Using Spark SQL with multiple (avro) files

2015-01-15 Thread David Jones
I've tried this now. Spark can load multiple avro files from the same directory by passing a path to a directory. However, passing multiple paths separated with commas didn't work. Is there any way to load all avro files in multiple directories using sqlContext.avroFile? On Wed, Jan 14, 2015 at

Re: MissingRequirementError with spark

2015-01-15 Thread sarsol
added fork :=true in Scala Build. Commandline sbt is working fine but Eclipse SCALA IDE is still giving same error. This was all working fine untill Spark 1.1. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MissingRequirementError-with-spark-tp21149p211

Using native blas with mllib

2015-01-15 Thread lev
Hi, I'm trying to use the native blas, and I followed all the threads I saw here and I still can't get rid of those warning: WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib

Spark-Sql not using cluster slaves

2015-01-15 Thread ahmedabdelrahman
I have been using Spark sql in cluster mode and I am noticing no distribution and parallelization of the query execution. The performance seems to be very slow compared to native spark applications and does not offer any speedup when compared to HIVE. I am using Spark 1.1.0 with a cluster of 5 node

Re: kinesis creating stream scala code exception

2015-01-15 Thread Aniket Bhatnagar
Are you using spark in standalone mode or yarn or mesos? If its yarn, please mention the hadoop distribution and version. What spark distribution are you using (it seems 1.2.0 but compiled with which hadoop version)? Thanks, Aniket On Thu, Jan 15, 2015, 4:59 PM Hafiz Mujadid wrote: > Hi, Exper

PySpark saveAsTextFile gzip

2015-01-15 Thread Tom Seddon
Hi, I've searched but can't seem to find a PySpark example. How do I write compressed text file output to S3 using PySpark saveAsTextFile? Thanks, Tom

kinesis creating stream scala code exception

2015-01-15 Thread Hafiz Mujadid
Hi, Expert I want to consumes data from kinesis stream using spark streaming. I am trying to create kinesis stream using scala code. Here is my code def main(args: Array[String]) { println("Stream creation started") if(create(2)) println("Stream is created successfully

Re: MissingRequirementError with spark

2015-01-15 Thread Pierre B
I found this, which might be useful: https://github.com/deanwampler/spark-workshop/blob/master/project/Build.scala I seems that forking is needed. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MissingRequirementError-with-spark-tp21149p21153.html Sent fr

spark fault tolerance mechanism

2015-01-15 Thread YANG Fan
Hi, I'm quite interested in how Spark's fault tolerance works and I'd like to ask a question here. According to the paper, there are two kinds of dependencies--the wide dependency and the narrow dependency. My understanding is, if the operations I use are all "narrow", then when one machine crash

Re: MissingRequirementError with spark 1.2

2015-01-15 Thread sarsol
I am also getting the same error after 1.2 upgrade. application is crashing on this line rdd.registerTempTable("temp") -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MissingRequirementError-with-spark-tp21149p21152.html Sent from the Apache Spark User Lis

Re: Serializability: for vs. while loops

2015-01-15 Thread Tobias Pfeiffer
Aaron, thanks for your mail! On Thu, Jan 15, 2015 at 5:05 PM, Aaron Davidson wrote: > Scala for-loops are implemented as closures using anonymous inner classes > [...] > While loops, on the other hand, involve none of this trickery, and > everyone is happy. > Ah, I was suspecting something lik

Re: *ByKey aggregations: performance + order

2015-01-15 Thread Sean Owen
I'm interested too and don't know for sure but I do not think this case is optimized this way. However if you know your keys aren't split across partitions and you have small enough partitions you can implement the same grouping with mapPartitions and Scala. On Jan 15, 2015 1:27 AM, "Tobias Pfeiffe

Re: ScalaReflectionException when using saveAsParquetFile in sbt

2015-01-15 Thread Pierre B
Same problem here... Did u find a solution for this? P. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ScalaReflectionException-when-using-saveAsParquetFile-in-sbt-tp21020p21150.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

MissingRequirementError with spark

2015-01-15 Thread Pierre B
After upgrading our project to Spark 1.2.0, we get this error when doing a "sbt test": scala.reflect.internal.MissingRequirementError: class org.apache.spark.sql.catalyst.ScalaReflection The strange thing is that when running our test suites from IntelliJ, everything runs smoothly... Any idea w

Re: Fast HashSets & HashMaps - Spark Collection Utils

2015-01-15 Thread Sean Owen
A recent discussion says these won't be public. However there are many optimized collection libs in Java. My favorite is Koloboke: https://github.com/OpenHFT/Koloboke/wiki/Koloboke:-roll-the-collection-implementation-with-features-you-need Carrot HPPC is good too. The only catch is that the librari

Error connecting to localhost:8060: java.net.ConnectException: Connection refused

2015-01-15 Thread Gautam Bajaj
Hi, I'm new to Apache Storm. I'm receiving data at my UDP port 8060, I want to capture it and perform some operations in the real time, for which I'm using Spark Streaming. While the code seems to be correct, I get the following output: https://gist.github.com/d34th4ck3r/0e88896eac864d6d7193 I'm

Visualize Spark Job

2015-01-15 Thread Kuromatsu, Nobuyuki
Hi I want to visualize tasks and stages in order to analyze spark jobs. I know necessary metrics is written in spark.eventLog.dir. Does anyone know the tool like swimlanes in Tez? Regards, Nobuyuki Kuromatsu - To unsubscribe,

Re: RowMatrix multiplication

2015-01-15 Thread Toni Verbeiren
You can always define an RDD transpose function yourself. This is what I use in PySpark to transpose an RDD of numpy vectors. It’s not optimal and the vectors need to fit in memory on the worker nodes. def rddTranspose(rdd): # add an index to the rows and the columns, result in triplet da

Re: Serializability: for vs. while loops

2015-01-15 Thread Aaron Davidson
Scala for-loops are implemented as closures using anonymous inner classes which are instantiated once and invoked many times. This means, though, that the code inside the loop is actually sitting inside a class, which confuses Spark's Closure Cleaner, whose job is to remove unused references from c