Re: Idempotent count

2015-03-18 Thread Arush Kharbanda
Hi Binh, It stores the state as well the unprocessed data. It is a subset of the records that you aggregated so far. This provides a good reference for checkpointing. http://spark.apache.org/docs/1.2.1/streaming-programming-guide.html#checkpointing On Wed, Mar 18, 2015 at 12:52 PM, Binh

Re: updateStateByKey performance API

2015-03-18 Thread Akhil Das
You can always throw more machines at this and see if the performance is increasing. Since you haven't mentioned anything regarding your # cores etc. Thanks Best Regards On Wed, Mar 18, 2015 at 11:42 AM, nvrs nvior...@gmail.com wrote: Hi all, We are having a few issues with the performance

Re: updateStateByKey performance API

2015-03-18 Thread Nikos Viorres
Hi Akhil, Yes, that's what we are planning on doing at the end of the data. At the moment I am doing performance testing before the job hits production and testing on 4 cores to get baseline figures and deduced that in order to grow to 10 - 15 million keys we ll need at batch interval of ~20 secs

Apache Spark ALS recommendations approach

2015-03-18 Thread Aram Mkrtchyan
Trying to build recommendation system using Spark MLLib's ALS. Currently, we're trying to pre-build recommendations for all users on daily basis. We're using simple implicit feedbacks and ALS. The problem is, we have 20M users and 30M products, and to call the main predict() method, we need to

Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Roberto Coluccio
Hi everybody, When trying to upgrade from Spark 1.1.1 to Spark 1.2.x (tried both 1.2.0 and 1.2.1) I encounter a weird error never occurred before about which I'd kindly ask for any possible help. In particular, all my Spark SQL queries fail with the following exception:

Re: Spark Job History Server

2015-03-18 Thread Akhil Das
You can simply turn it on using: ./sbin/start-history-server.sh ​Read more here http://spark.apache.org/docs/1.3.0/monitoring.html.​ Thanks Best Regards On Wed, Mar 18, 2015 at 4:00 PM, patcharee patcharee.thong...@uni.no wrote: Hi, I am using spark 1.3. I would like to use Spark Job

Re: 1.3 release

2015-03-18 Thread Sean Owen
I don't think this is the problem, but I think you'd also want to set -Dhadoop.version= to match your deployment version, if you're building for a particular version, just to be safe-est. I don't recall seeing that particular error before. It indicates to me that the SparkContext is null. Is this

Re: Spark + Kafka

2015-03-18 Thread Jeffrey Jedele
Probably 1.3.0 - it has some improvements in the included Kafka receiver for streaming. https://spark.apache.org/releases/spark-release-1-3-0.html Regards, Jeff 2015-03-18 10:38 GMT+01:00 James King jakwebin...@gmail.com: Hi All, Which build of Spark is best when using Kafka? Regards jk

Re: [spark-streaming] can shuffle write to disk be disabled?

2015-03-18 Thread Darren Hoo
Thanks, Shao On Wed, Mar 18, 2015 at 3:34 PM, Shao, Saisai saisai.s...@intel.com wrote: Yeah, as I said your job processing time is much larger than the sliding window, and streaming job is executed one by one in sequence, so the next job will wait until the first job is finished, so the

Re: Spark Job History Server

2015-03-18 Thread patcharee
I turned it on. But it failed to start. In the log, Spark assembly has been built with Hive, including Datanucleus jars on classpath Spark Command: /usr/lib/jvm/java-1.7.0-openjdk.x86_64/bin/java -cp

Re: Using Spark with a SOCKS proxy

2015-03-18 Thread Akhil Das
Did you try ssh tunneling instead of SOCKS? Thanks Best Regards On Wed, Mar 18, 2015 at 5:45 AM, Kelly, Jonathan jonat...@amazon.com wrote: I'm trying to figure out how I might be able to use Spark with a SOCKS proxy. That is, my dream is to be able to write code in my IDE then run it

Re: Spark + Kafka

2015-03-18 Thread James King
Thanks Jeff, I'm planning to use it in standalone mode, OK will use hadoop 2.4 package. Chao! On Wed, Mar 18, 2015 at 10:56 AM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: What you call sub-category are packages pre-built to run on certain Hadoop environments. It really depends on where

Re: HIVE SparkSQL

2015-03-18 Thread 宫勐
Hi: I need to count some Game Player Events in the game. Such as : How Many Players stay in the game scene 1--Save the Princess from a Dragon Moneys they have paid in the last 5 min How many players pay money for go through this scene much more

Re: Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Cheng Lian
Would you mind to provide the query? If it's confidential, could you please help constructing a query that reproduces this issue? Cheng On 3/18/15 6:03 PM, Roberto Coluccio wrote: Hi everybody, When trying to upgrade from Spark 1.1.1 to Spark 1.2.x (tried both 1.2.0 and 1.2.1) I encounter a

Re: GraphX: Get edges for a vertex

2015-03-18 Thread Jeffrey Jedele
Hi Mas, I never actually worked with GraphX, but one idea: As far as I know, you can directly access the vertex and edge RDDs of your Graph object. Why not simply run a .filter() on the edge RDD to get all edges that originate from or end at your vertex? Regards, Jeff 2015-03-18 10:52 GMT+01:00

Re: [spark-streaming] can shuffle write to disk be disabled?

2015-03-18 Thread Akhil Das
I think you can disable it with spark.shuffle.spill=false Thanks Best Regards On Wed, Mar 18, 2015 at 3:39 PM, Darren Hoo darren@gmail.com wrote: Thanks, Shao On Wed, Mar 18, 2015 at 3:34 PM, Shao, Saisai saisai.s...@intel.com wrote: Yeah, as I said your job processing time is much

Re: Spark + Kafka

2015-03-18 Thread Jeffrey Jedele
What you call sub-category are packages pre-built to run on certain Hadoop environments. It really depends on where you want to run Spark. As far as I know, this is mainly about the included HDFS binding - so if you just want to play around with Spark, any of the packages should be fine. I

Re: [spark-streaming] can shuffle write to disk be disabled?

2015-03-18 Thread Darren Hoo
I've already done that: From SparkUI Environment Spark properties has: spark.shuffle.spillfalse On Wed, Mar 18, 2015 at 6:34 PM, Akhil Das ak...@sigmoidanalytics.com wrote: I think you can disable it with spark.shuffle.spill=false Thanks Best Regards On Wed, Mar 18, 2015 at 3:39 PM,

Re: Spark Job History Server

2015-03-18 Thread Akhil Das
You are not having yarn package in the classpath. You need to build your spark it with yarn. You can read these docs. http://spark.apache.org/docs/1.3.0/running-on-yarn.html Thanks Best Regards On Wed, Mar 18, 2015 at 4:07 PM, patcharee patcharee.thong...@uni.no wrote: I turned it on. But it

Spark Job History Server

2015-03-18 Thread patcharee
Hi, I am using spark 1.3. I would like to use Spark Job History Server. I added the following line into conf/spark-defaults.conf spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService spark.history.provider org.apache.spark.deploy.yarn.history.YarnHistoryProvider

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread gen tang
Hi, If you do cartesian join to predict users' preference over all the products, I think that 8 nodes with 64GB ram would not be enough for the data. Recently, I used als for a similar situation, but just 10M users and 0.1M products, the minimum requirement is 9 nodes with 10GB RAM. Moreover,

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Aram Mkrtchyan
Thanks much for your reply. By saying on the fly, you mean caching the trained model, and querying it for each user joined with 30M products when needed? Our question is more about the general approach, what if we have 7M DAU? How the companies deal with that using Spark? On Wed, Mar 18, 2015

Integration of Spark1.2.0 cdh4 with Jetty 9.2.10

2015-03-18 Thread sayantini
Hi all, We are using spark-assembly-1.2.0-hadoop 2.0.0-mr1-cdh4.2.0.jar in our application. When we try to deploy the application on Jetty (jetty-distribution-9.2.10.v20150310) we get the below exception at the server startup. Initially we were getting the below exception, Caused by:

Re: Spark Job History Server

2015-03-18 Thread patcharee
Hi, My spark was compiled with yarn profile, I can run spark on yarn without problem. For the spark job history server problem, I checked spark-assembly-1.3.0-hadoop2.4.0.jar and found that the package org.apache.spark.deploy.yarn.history is missing. I don't know why BR, Patcharee

sparksql native jdbc driver

2015-03-18 Thread sequoiadb
hey guys, In my understanding SparkSQL only supports JDBC connection through hive thrift server, is this correct? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

RE: [spark-streaming] can shuffle write to disk be disabled?

2015-03-18 Thread Shao, Saisai
From the log you pasted I think this (-rw-r--r-- 1 root root 80K Mar 18 16:54 shuffle_47_519_0.data) is not shuffle spilled data, but the final shuffle result. As I said, did you think shuffle is the bottleneck which makes your job running slowly? Maybe you should identify the cause at first.

DataFrame operation on parquet: GC overhead limit exceeded

2015-03-18 Thread Yiannis Gkoufas
Hi there, I was trying the new DataFrame API with some basic operations on a parquet dataset. I have 7 nodes of 12 cores and 8GB RAM allocated to each worker in a standalone cluster mode. The code is the following: val people = sqlContext.parquetFile(/data.parquet); val res =

Re: sparksql native jdbc driver

2015-03-18 Thread Arush Kharbanda
Yes, I have been using Spark SQL from the onset. Haven't found any other Server for Spark SQL for JDBC connectivity. On Wed, Mar 18, 2015 at 5:50 PM, sequoiadb mailing-list-r...@sequoiadb.com wrote: hey guys, In my understanding SparkSQL only supports JDBC connection through hive thrift

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Sean Owen
Not just the join, but this means you're trying to compute 600 trillion dot products. It will never finish fast. Basically: don't do this :) You don't in general compute all recommendations for all users, but recompute for a small subset of users that were or are likely to be active soon. (Or

Apache Spark ALS recommendations approach

2015-03-18 Thread Aram
Hi all, Trying to build recommendation system using Spark MLLib's ALS. Currently, we're trying to pre-build recommendations for all users on daily basis. We're using simple implicit feedbacks and ALS. The problem is, we have 20M users and 30M products, and to call the main predict() method, we

Re: [spark-streaming] can shuffle write to disk be disabled?

2015-03-18 Thread Darren Hoo
On Wed, Mar 18, 2015 at 8:31 PM, Shao, Saisai saisai.s...@intel.com wrote: From the log you pasted I think this (-rw-r--r-- 1 root root 80K Mar 18 16:54 shuffle_47_519_0.data) is not shuffle spilled data, but the final shuffle result. why the shuffle result is written to disk? As I

srcAttr in graph.triplets don't update when the size of graph is huge

2015-03-18 Thread 张林(林岳)
when the size of the graph is huge(0.2 billion vertex, 6 billion edges), the srcAttr and dstAttr in graph.triplets don't update when using the Graph.outerJoinVertices(when the data in vertex is changed). the code and the log is as follows: g = graph.outerJoinVertices()... g,vertices,count()

Re: Spark Job History Server

2015-03-18 Thread Marcelo Vanzin
Those classes are not part of standard Spark. You may want to contact Hortonworks directly if they're suggesting you use those. On Wed, Mar 18, 2015 at 3:30 AM, patcharee patcharee.thong...@uni.no wrote: Hi, I am using spark 1.3. I would like to use Spark Job History Server. I added the

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Sean Owen
I don't think that you need memory to put the whole joined data set in memory. However memory is unlikely to be the limiting factor, it's the massive shuffle. OK, you really do have a large recommendation problem if you're recommending for at least 7M users per day! My hunch is that it won't be

Re: Difference among batchDuration, windowDuration, slideDuration

2015-03-18 Thread jaredtims
I think hsy541 is still confused by what is still confusing to me. Namely, what is the value that sentence Each RDD in a DStream contains data from a certain interval is speaking of? This is from the Discretized Streams

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-18 Thread Cheng Lian
You should probably increase executor memory by setting spark.executor.memory. Full list of available configurations can be found here http://spark.apache.org/docs/latest/configuration.html Cheng On 3/18/15 9:15 PM, Yiannis Gkoufas wrote: Hi there, I was trying the new DataFrame API with

Re: Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Cheng Lian
I suspect that you hit this bug https://issues.apache.org/jira/browse/SPARK-6250, it depends on the actual contents of your query. Yin had opened a PR for this, although not merged yet, it should be a valid fix https://github.com/apache/spark/pull/5078 This fix will be included in 1.3.1.

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Debasish Das
There is also a batch prediction API in PR https://github.com/apache/spark/pull/3098 Idea here is what Sean said...don't try to reconstruct the whole matrix which will be dense but pick a set of users and calculate topk recommendations for them using dense level 3 blas.we are going to merge

Re: sparksql native jdbc driver

2015-03-18 Thread Cheng Lian
Yes On 3/18/15 8:20 PM, sequoiadb wrote: hey guys, In my understanding SparkSQL only supports JDBC connection through hive thrift server, is this correct? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Aram Mkrtchyan
Thanks gen for helpful post. Thank you Sean, we're currently exploring this world of recommendations with Spark, and your posts are very helpful to us. We've noticed that you're a co-author of Advanced Analytics with Spark, just not to get to deep into offtopic, will it be finished soon? On Wed,

Column Similarity using DIMSUM

2015-03-18 Thread Manish Gupta 8
Hi, I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on a dataset that looks like (Entity, Attribute, Value) after transforming the same to a row-oriented dense matrix format (one line per Attribute, one column per Entity, each cell with normalized value – between 0

Re: Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Roberto Coluccio
You know, I actually have one of the columns called timestamp ! This may really cause the problem reported in the bug you linked, I guess. On Wed, Mar 18, 2015 at 3:37 PM, Cheng Lian lian.cs@gmail.com wrote: I suspect that you hit this bug https://issues.apache.org/jira/browse/SPARK-6250,

Re: Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Roberto Coluccio
Hi Cheng, thanks for your reply. The query is something like: SELECT * FROM ( SELECT m.column1, IF (d.columnA IS NOT null, d.columnA, m.column2), ..., m.columnN FROM tableD d RIGHT OUTER JOIN tableM m on m.column2 = d.columnA WHERE m.column2!=\None\ AND d.columnA!=\\ UNION ALL SELECT

Did DataFrames break basic SQLContext?

2015-03-18 Thread Justin Pihony
I started to play with 1.3.0 and found that there are a lot of breaking changes. Previously, I could do the following: case class Foo(x: Int) val rdd = sc.parallelize(List(Foo(1))) import sqlContext._ rdd.registerTempTable(foo) Now, I am not able to directly use my RDD object and

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-18 Thread Yiannis Gkoufas
Hi there, I set the executor memory to 8g but it didn't help On 18 March 2015 at 13:59, Cheng Lian lian.cs@gmail.com wrote: You should probably increase executor memory by setting spark.executor.memory. Full list of available configurations can be found here

Re: Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Roberto Coluccio
Hey Cheng, thank you so much for your suggestion, the problem was actually a column/field called timestamp in one of the case classes!! Once I changed its name everything worked out fine again. Let me say it was kinda frustrating ... Roberto On Wed, Mar 18, 2015 at 4:07 PM, Roberto Coluccio

Re: RDD to InputStream

2015-03-18 Thread Ayoub
In case it would interest other peoples, here is what I come up with and it seems to work fine: case class RDDAsInputStream(private val rdd: RDD[String]) extends java.io.InputStream { var bytes = rdd.flatMap(_.getBytes(UTF-8)).toLocalIterator def read(): Int = { if(bytes.hasNext)

Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ranga
Thanks for the information. Will rebuild with 0.6.0 till the patch is merged. On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu yuzhih...@gmail.com wrote: Ranga: Take a look at https://github.com/apache/spark/pull/4867 Cheers On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com fightf...@163.com

Re: Did DataFrames break basic SQLContext?

2015-03-18 Thread Justin Pihony
It appears that the metastore_db problem is related to https://issues.apache.org/jira/browse/SPARK-4758. I had another shell open that was stuck. This is probably a bug, though? import sqlContext.implicits case class Foo(x: Int) val rdd = sc.parallelize(List(Foo(1))) rdd.toDF

Re: How to get the cached RDD

2015-03-18 Thread praveenbalaji
sc.getPersistentRDDs(0).asInstanceOf[RDD[Array[Double]]] -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-cached-RDD-tp22114p22122.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: saveAsTable fails to save RDD in Spark SQL 1.3.0

2015-03-18 Thread Shahdad Moradi
/user/hive/warehouse is a hdfs location. I’ve changed the mod for this location but I’m still having the same issue. hduser@hadoop01-VirtualBox:/opt/spark/bin$ hdfs dfs -chmod -R 777 /user/hive hduser@hadoop01-VirtualBox:/opt/spark/bin$ hdfs dfs -ls /user/hive/warehouse Found 1 items

mapPartitions - How Does it Works

2015-03-18 Thread ashish.usoni
I am trying to understand about mapPartitions but i am still not sure how it works in the below example it create three partition val parallel = sc.parallelize(1 to 10, 3) and when we do below parallel.mapPartitions( x = List(x.next).iterator).collect it prints value Array[Int] = Array(1, 4,

RE: mapPartitions - How Does it Works

2015-03-18 Thread java8964
Here is what I think: mapPartitions is for a specialized map that is called only once for each partition. The entire content of the respective partitions is available as a sequential stream of values via the input argument (Iterarator[T]). The combined result iterators are automatically

Re: mapPartitions - How Does it Works

2015-03-18 Thread Ganelin, Ilya
Map partitions works as follows : 1) For each partition of your RDD, it provides an iterator over the values within that partition 2) You then define a function that operates on that iterator Thus if you do the following: val parallel = sc.parallelize(1 to 10, 3) parallel.mapPartitions( x =

RE: Column Similarity using DIMSUM

2015-03-18 Thread Manish Gupta 8
Hi Reza, I have tried threshold to be only in the range of 0 to 1. I was not aware that threshold can be set to above 1. Will try and update. Thank You - Manish From: Reza Zadeh [mailto:r...@databricks.com] Sent: Wednesday, March 18, 2015 10:55 PM To: Manish Gupta 8 Cc: user@spark.apache.org

Re: Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Yin Huai
Hi Roberto, For now, if the timestamp is a top level column (not a field in a struct), you can use use backticks to quote the column name like `timestamp `. Thanks, Yin On Wed, Mar 18, 2015 at 12:10 PM, Roberto Coluccio roberto.coluc...@gmail.com wrote: Hey Cheng, thank you so much for your

Database operations on executor nodes

2015-03-18 Thread Praveen Balaji
I was wondering what people generally do about doing database operations from executor nodes. I’m (at least for now) avoiding doing database updates from executor nodes to avoid proliferation of database connections on the cluster. The general pattern I adopt is to collect queries (or tuples)

Null pointer exception reading Parquet

2015-03-18 Thread sprookie
Hi All, I am using Saprk version 1.2 running locally. When I try to read a paquet file I get below exception, what might be the issue? Any help will be appreciated. This is the simplest operation/action on a parquet file. //code snippet// val sparkConf = new SparkConf().setAppName(

Re: Did DataFrames break basic SQLContext?

2015-03-18 Thread Nick Pentreath
To answer your first question - yes 1.3.0 did break backward compatibility for the change from SchemaRDD - DataFrame. SparkSQL was an alpha component so api breaking changes could happen. It is no longer an alpha component as of 1.3.0 so this will not be the case in future. Adding toDF

Re: Spark + Kafka

2015-03-18 Thread Khanderao Kand Gmail
I have used various version of spark (1.0, 1.2.1) without any issues . Though I have not significantly used kafka with 1.3.0 , a preliminary testing revealed no issues . - khanderao On Mar 18, 2015, at 2:38 AM, James King jakwebin...@gmail.com wrote: Hi All, Which build of Spark is

Re: 1.3 release

2015-03-18 Thread Eric Friedman
Sean, you are exactly right, as I learned from parsing your earlier reply more carefully -- sorry I didn't do this the first time. Setting hadoop.version was indeed the solution ./make-distribution.sh --tgz -Pyarn -Phadoop-2.4 -Phive -Phive-thriftserver -Dhadoop.version=2.5.0-cdh5.3.2 Thanks

topic modeling using LDA in MLLib

2015-03-18 Thread heszak
I'm coming from a Hadoop background but I'm totally new to Apache Spark. I'd like to do topic modeling using LDA algorithm on some txt files. The example on the Spark website assumes that the input to the LDA is a file containing the words counts. I wonder if someone could help me figuring out the

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-18 Thread Yiannis Gkoufas
Hi Yin, Thanks for your feedback. I have 1700 parquet files, sized 100MB each. The number of tasks launched is equal to the number of parquet files. Do you have any idea on how to deal with this situation? Thanks a lot On 18 Mar 2015 17:35, Yin Huai yh...@databricks.com wrote: Seems there are

Re: Spark + HBase + Kerberos

2015-03-18 Thread Eric Walk
Hi Ted, The spark executors and hbase regions/masters are all collocated. This is a 2 node test environment. Best, Eric Eric Walk, Sr. Technical Consultant p: 617.855.9255 | NASDAQ: PRFT | Perficient.comhttp://www.perficient.com/ From: Ted Yu yuzhih...@gmail.com Sent: Mar 18, 2015

Re: mapPartitions - How Does it Works

2015-03-18 Thread Alex Turner (TMS)
List(x.next).iterator is giving you the first element from each partition, which would be 1, 4 and 7 respectively. On 3/18/15, 10:19 AM, ashish.usoni ashish.us...@gmail.com wrote: I am trying to understand about mapPartitions but i am still not sure how it works in the below example it create

Re: [spark-streaming] can shuffle write to disk be disabled?

2015-03-18 Thread Darren Hoo
Hi, Saisai Here is the duration of one of the jobs, 22 seconds in total, it is longer than the sliding window. Stage Id Description Submitted Duration Tasks: Succeeded/Total Input Output Shuffle Read Shuffle Write 342foreach at SimpleApp.scala:58 2015/03/18

Re: Idempotent count

2015-03-18 Thread Binh Nguyen Van
Hi Arush, Thank you for answering! When you say checkpoints hold metadata and Data, what is the Data? Is it the Data that is pulled from input source or is it the state? If it is state then is it the same number of records that I aggregated since beginning or only a subset of it? How can I limit

RE: [spark-streaming] can shuffle write to disk be disabled?

2015-03-18 Thread Shao, Saisai
Yeah, as I said your job processing time is much larger than the sliding window, and streaming job is executed one by one in sequence, so the next job will wait until the first job is finished, so the total latency will be accumulated. I think you need to identify the bottleneck of your job at

RE: [SQL] Elasticsearch-hadoop, exception creating temporary table

2015-03-18 Thread Cheng, Hao
Seems the elasticsearch-hadoop project was built with an old version of Spark, and then you upgraded the Spark version in execution env, as I know the StructField changed the definition in Spark 1.2, can you confirm the version problem first? From: Todd Nist [mailto:tsind...@gmail.com] Sent:

saving or visualizing PCA

2015-03-18 Thread roni
Hi , I am generating PCA using spark . But I dont know how to save it to disk or visualize it. Can some one give me some pointerspl. Thanks -Roni

RE: saveAsTable fails to save RDD in Spark SQL 1.3.0

2015-03-18 Thread Shahdad Moradi
Sun, Just want to confirm that it was in fact an authentication issue. The issue is resolved now and I can see my tables through Simba ODBC driver. Thanks a lot. Shahdad From: fightf...@163.com [mailto:fightf...@163.com] Sent: March-17-15 6:33 PM To: Shahdad Moradi; user Subject: Re:

Re: [SQL] Elasticsearch-hadoop, exception creating temporary table

2015-03-18 Thread Todd Nist
Thanks for the quick response. The spark server is spark-1.2.1-bin-hadoop2.4 from the Spark download. Here is the startup: radtech$ ./sbin/start-master.sh starting org.apache.spark.deploy.master.Master, logging to

Apache Spark User List: people's responses not showing in the browser view

2015-03-18 Thread dgoldenberg
Sorry if this is a total noob question but is there a reason why I'm only seeing folks' responses to my posts in emails but not in the browser view under apache-spark-user-list.1001560.n3.nabble.com? Is this a matter of setting your preferences such that your responses only go to email and never

RE: [spark-streaming] can shuffle write to disk be disabled?

2015-03-18 Thread Shao, Saisai
Please see the inline comments. Thanks Jerry From: Darren Hoo [mailto:darren@gmail.com] Sent: Wednesday, March 18, 2015 9:30 PM To: Shao, Saisai Cc: user@spark.apache.org; Akhil Das Subject: Re: [spark-streaming] can shuffle write to disk be disabled? On Wed, Mar 18, 2015 at 8:31 PM,

[SQL] Elasticsearch-hadoop, exception creating temporary table

2015-03-18 Thread Todd Nist
I am attempting to access ElasticSearch and expose it’s data through SparkSQL using the elasticsearch-hadoop project. I am encountering the following exception when trying to create a Temporary table from a resource in ElasticSearch.: 15/03/18 07:54:46 INFO DAGScheduler: Job 2 finished: runJob

Re: mapPartitions - How Does it Works

2015-03-18 Thread Sabarish Sasidharan
Unlike a map() wherein your task is acting on a row at a time, with mapPartitions(), the task is passed the entire content of the partition in an iterator. You can then return back another iterator as the output. I don't do scala, but from what I understand from your code snippet... The iterator

iPython Notebook + Spark + Accumulo -- best practice?

2015-03-18 Thread davidh
hi all, I've been DDGing, Stack Overflowing, Twittering, RTFMing, and scanning through this archive with only moderate success. in other words -- my way of saying sorry if this is answered somewhere obvious and I missed it :-) i've been tasked with figuring out how to connect Notebook, Spark,

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-18 Thread Irfan Ahmad
Hi David, W00t indeed and great questions. On the notebook front, there are two options depending on what you are looking for. You can either go with iPython 3 with Spark-kernel as a backend or you can use spark-notebook. Both have interesting tradeoffs. If you have looking for a single notebook

SparkSQL 1.3.0 JDBC data source issues

2015-03-18 Thread Pei-Lun Lee
Hi, I am trying jdbc data source in spark sql 1.3.0 and found some issues. First, the syntax where str_col='value' will give error for both postgresql and mysql: psql create table foo(id int primary key,name text,age int); bash SPARK_CLASSPATH=postgresql-9.4-1201-jdbc41.jar

Re: MEMORY_ONLY vs MEMORY_AND_DISK

2015-03-18 Thread Prannoy
It depends. If the data size on which the calculation is to be done is very large than caching it with MEMORY_AND_DISK is useful. Even in this case MEMORY_AND_DISK is useful if the computation on the RDD is expensive. If the compution is very small than even for large data sets MEMORY_ONLY can be

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-18 Thread Bharath Ravi Kumar
Thanks for clarifying Todd. This may then be an issue specific to the HDP version we're using. Will continue to debug and post back if there's any resolution. On Thu, Mar 19, 2015 at 3:40 AM, Todd Nist tsind...@gmail.com wrote: Yes I believe you are correct. For the build you may need to

Re: Spark + HBase + Kerberos

2015-03-18 Thread Ted Yu
Are hbase config / keytab files deployed on executor machines ? Consider adding -Dsun.security.krb5.debug=true for debug purpose. Cheers On Wed, Mar 18, 2015 at 11:39 AM, Eric Walk eric.w...@perficient.com wrote: Having an issue connecting to HBase from a Spark container in a Secure

RDD pair to pair of RDDs

2015-03-18 Thread Alex Turner (TMS)
What's the best way to go from: RDD[(A, B)] to (RDD[A], RDD[B]) If I do: def separate[A, B](k: RDD[(A, B)]) = (k.map(_._1), k.map(_._2)) Which is the obvious solution, this runs two maps in the cluster. Can I do some kind of a fold instead: def separate[A, B](l: List[(A, B)]) =

Re: Column Similarity using DIMSUM

2015-03-18 Thread Reza Zadeh
Hi Manish, Did you try calling columnSimilarities(threshold) with different threshold values? You try threshold values of 0.1, 0.5, 1, and 20, and higher. Best, Reza On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 mgupt...@sapient.com wrote: Hi, I am running Column Similarity (All Pairs

[Spark SQL] Elasticsearch-hadoop - exception when creating Temporary table

2015-03-18 Thread Todd Nist
I am attempting to access ElasticSearch and expose it’s data through SparkSQL using the elasticsearch-hadoop project. I am encountering the following exception when trying to create a Temporary table from a resource in ElasticSearch.: 15/03/18 07:54:46 INFO DAGScheduler: Job 2 finished: runJob

Using a different spark jars than the one on the cluster

2015-03-18 Thread jaykatukuri
Hi all, I am trying to run my job which needs spark-sql_2.11-1.3.0.jar. The cluster that I am running on is still on spark-1.2.0. I tried the following : spark-submit --class class-name --num-executors 100 --master yarn application_jar--jars hdfs:///path/spark-sql_2.11-1.3.0.jar

Spark Streaming S3 Performance Implications

2015-03-18 Thread Mike Trienis
Hi All, I am pushing data from Kinesis stream to S3 using Spark Streaming and noticed that during testing (i.e. master=local[2]) the batches (1 second intervals) were falling behind the incoming data stream at about 5-10 events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking

Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ranga
I just tested with Spark-1.3.0 + Tachyon-0.6.0 and still see the same issue. Here are the logs: 15/03/18 11:44:11 ERROR : Invalid method name: 'getDataFolder' 15/03/18 11:44:11 ERROR : Invalid method name: 'user_getFileId' 15/03/18 11:44:11 ERROR storage.TachyonBlockManager: Failed 10 attempts to

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-18 Thread Todd Nist
Yes I believe you are correct. For the build you may need to specify the specific HDP version of hadoop to use with the -Dhadoop.version=. I went with the default 2.6.0, but Horton may have a vendor specific version that needs to go here. I know I saw a similar post today where the solution

Re: Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-18 Thread Eugen Cepoi
Hey Dimitriy, thanks for sharing your solution. I have some more updates. The problem comes out when shuffle is involved. Using coalesce shuffle true behaves like reduceByKey+smaller num of partitions, except that the whole save stage hangs. I am not sure yet if it only happens with UnionRDD or

Spark and Morphlines, parallelization, multithreading

2015-03-18 Thread dgoldenberg
Still a Spark noob grappling with the concepts... I'm trying to grok the idea of integrating something like the Morphlines pipelining library with Spark (or SparkStreaming). The Kite/Morphlines doc states that runtime executes all commands of a given morphline in the same thread... there are no

Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Haoyuan Li
Did you recompile it with Tachyon 0.6.0? Also, Tachyon 0.6.1 has been released this morning: http://tachyon-project.org/ ; https://github.com/amplab/tachyon/releases Best regards, Haoyuan On Wed, Mar 18, 2015 at 11:48 AM, Ranga sra...@gmail.com wrote: I just tested with Spark-1.3.0 +

Re: Using a different spark jars than the one on the cluster

2015-03-18 Thread Marcelo Vanzin
Since you're using YARN, you should be able to download a Spark 1.3.0 tarball from Spark's website and use spark-submit from that installation to launch your app against the YARN cluster. So effectively you would have 1.2.0 and 1.3.0 side-by-side in your cluster. On Wed, Mar 18, 2015 at 11:09

Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ranga
Thanks Ted. Will do. On Wed, Mar 18, 2015 at 2:27 PM, Ted Yu yuzhih...@gmail.com wrote: Ranga: Please apply the patch from: https://github.com/apache/spark/pull/4867 And rebuild Spark - the build would use Tachyon-0.6.1 Cheers On Wed, Mar 18, 2015 at 2:23 PM, Ranga sra...@gmail.com

RDD ordering after map

2015-03-18 Thread sergunok
Does map(...) preserve ordering of original RDD? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-ordering-after-map-tp22129.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

MEMORY_ONLY vs MEMORY_AND_DISK

2015-03-18 Thread sergunok
What persistance level is better if RDD to be cached is heavily to be recalculated? Am I right it is MEMORY_AND_DISK? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MEMORY-ONLY-vs-MEMORY-AND-DISK-tp22130.html Sent from the Apache Spark User List mailing

Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ranga
Hi Haoyuan No. I assumed that Spark-1.3.0 was already built with Tachyon-0.6.0. If not, I can rebuild and try. Could you let me know how to rebuild with 0.6.0? Thanks for your help. - Ranga On Wed, Mar 18, 2015 at 12:59 PM, Haoyuan Li haoyuan...@gmail.com wrote: Did you recompile it with

Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ted Yu
Ranga: Please apply the patch from: https://github.com/apache/spark/pull/4867 And rebuild Spark - the build would use Tachyon-0.6.1 Cheers On Wed, Mar 18, 2015 at 2:23 PM, Ranga sra...@gmail.com wrote: Hi Haoyuan No. I assumed that Spark-1.3.0 was already built with Tachyon-0.6.0. If not,

Re: RDD ordering after map

2015-03-18 Thread Burak Yavuz
Hi, Yes, ordering is preserved with map. Shuffles break ordering. Burak On Wed, Mar 18, 2015 at 2:02 PM, sergunok ser...@gmail.com wrote: Does map(...) preserve ordering of original RDD? -- View this message in context:

Does newly-released LDA (Latent Dirichlet Allocation) algorithm supports ngrams?

2015-03-18 Thread heszak
I wonder to know whether the newly-released LDA (Latent Dirichlet Allocation) algorithm only supports uni-gram or it can also supports bi/tri-grams too? If it can, can someone help me how I can use them? -- View this message in context:

  1   2   >