RE: configure number of cached partition in memory on SparkSQL

2015-03-18 Thread Judy Nash
Thanks Cheng for replying. Meant to say to change number of partitions of a cached table. It doesn’t need to be re-adjusted after caching. To provide more context: What I am seeing on my dataset is that we have a large number of tasks. Since it appears each task is mapped to a partition, I want

Need some help on the Spark performance on Hadoop Yarn

2015-03-18 Thread Yi Ming Huang
Dear Spark experts, I appreciate you can look into my problem and give me some help and suggestions here... Thank you! I have a simple Spark application to parse and analyze the log, and I can run it on my hadoop yarn cluster. The problem with me is that I find it runs quite slow on the cluster,

MLlib Spam example gets stuck in Stage X

2015-03-18 Thread Su She
Hello Everyone, I am trying to run this MLlib example from Learning Spark: https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala#L48 Things I'm doing differently: 1) Using spark shell instead of an application 2) instead of t

Error while Insert data into hive table via spark

2015-03-18 Thread Dhimant
Hi, I have configured apache spark 1.3.0 with hive 1.0.0 and hadoop 2.6.0. I am able to create table and retrive data from hive tables via following commands ,but not able insert data into table. scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS newtable (key INT)"); scala> sqlContext.sql("select

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-18 Thread Bharath Ravi Kumar
Thanks for clarifying Todd. This may then be an issue specific to the HDP version we're using. Will continue to debug and post back if there's any resolution. On Thu, Mar 19, 2015 at 3:40 AM, Todd Nist wrote: > Yes I believe you are correct. > > For the build you may need to specify the specific

SparkSQL 1.3.0 JDBC data source issues

2015-03-18 Thread Pei-Lun Lee
Hi, I am trying jdbc data source in spark sql 1.3.0 and found some issues. First, the syntax "where str_col='value'" will give error for both postgresql and mysql: psql> create table foo(id int primary key,name text,age int); bash> SPARK_CLASSPATH=postgresql-9.4-1201-jdbc41.jar spark/bin/spark-s

Re: MEMORY_ONLY vs MEMORY_AND_DISK

2015-03-18 Thread Prannoy
It depends. If the data size on which the calculation is to be done is very large than caching it with MEMORY_AND_DISK is useful. Even in this case MEMORY_AND_DISK is useful if the computation on the RDD is expensive. If the compution is very small than even for large data sets MEMORY_ONLY can be u

Re: saving or visualizing PCA

2015-03-18 Thread Reza Zadeh
You can visualize PCA for example by val N = 2 val pc: Matrix = mat.computePrincipalComponents(N) // Principal components are stored in a local dense matrix. // Project the rows to the linear space spanned by the top N principal components. val projected: RowMatrix = mat.multiply(pc) Each row of

Re: saving or visualizing PCA

2015-03-18 Thread Reza Zadeh
Also the guide on this is useful: http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html#principal-component-analysis-pca On Wed, Mar 18, 2015 at 11:46 PM, Reza Zadeh wrote: > You can visualize PCA for example by > > val N = 2 > val pc: Matrix = mat.computePrincipalComponents(N)

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-18 Thread Irfan Ahmad
I forgot to mention that there is also Zeppelin and jove-notebook but I haven't got any experience with those yet. *Irfan Ahmad* CTO | Co-Founder | *CloudPhysics* Best of VMworld Finalist Best Cloud Management Award NetworkWorld 10 Startups to Watch EMA Most Notable

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-18 Thread Irfan Ahmad
Hi David, W00t indeed and great questions. On the notebook front, there are two options depending on what you are looking for. You can either go with iPython 3 with Spark-kernel as a backend or you can use spark-notebook. Both have interesting tradeoffs. If you have looking for a single notebook

Re: mapPartitions - How Does it Works

2015-03-18 Thread Sabarish Sasidharan
Unlike a map() wherein your task is acting on a row at a time, with mapPartitions(), the task is passed the entire content of the partition in an iterator. You can then return back another iterator as the output. I don't do scala, but from what I understand from your code snippet... The iterator x

saving or visualizing PCA

2015-03-18 Thread roni
Hi , I am generating PCA using spark . But I dont know how to save it to disk or visualize it. Can some one give me some pointerspl. Thanks -Roni

Re: [SQL] Elasticsearch-hadoop, exception creating temporary table

2015-03-18 Thread Todd Nist
Thanks for the quick response. The spark server is spark-1.2.1-bin-hadoop2.4 from the Spark download. Here is the startup: radtech>$ ./sbin/start-master.sh starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark-1.2.1-bin-hadoop2.4/sbin/../logs/spark-tnist-org.apache.spark.dep

RE: [spark-streaming] can shuffle write to disk be disabled?

2015-03-18 Thread Shao, Saisai
Please see the inline comments. Thanks Jerry From: Darren Hoo [mailto:darren@gmail.com] Sent: Wednesday, March 18, 2015 9:30 PM To: Shao, Saisai Cc: user@spark.apache.org; Akhil Das Subject: Re: [spark-streaming] can shuffle write to disk be disabled? On Wed, Mar 18, 2015 at 8:31 PM, Shao,

RE: saveAsTable fails to save RDD in Spark SQL 1.3.0

2015-03-18 Thread Shahdad Moradi
Sun, Just want to confirm that it was in fact an authentication issue. The issue is resolved now and I can see my tables through Simba ODBC driver. Thanks a lot. Shahdad From: fightf...@163.com [mailto:fightf...@163.com] Sent: March-17-15 6:33 PM To: Shahdad Moradi; user Subject: Re: saveAsTabl

iPython Notebook + Spark + Accumulo -- best practice?

2015-03-18 Thread davidh
hi all, I've been DDGing, Stack Overflowing, Twittering, RTFMing, and scanning through this archive with only moderate success. in other words -- my way of saying sorry if this is answered somewhere obvious and I missed it :-) i've been tasked with figuring out how to connect Notebook, Spark, and

RE: [SQL] Elasticsearch-hadoop, exception creating temporary table

2015-03-18 Thread Cheng, Hao
Seems the elasticsearch-hadoop project was built with an old version of Spark, and then you upgraded the Spark version in execution env, as I know the StructField changed the definition in Spark 1.2, can you confirm the version problem first? From: Todd Nist [mailto:tsind...@gmail.com] Sent: Th

[SQL] Elasticsearch-hadoop, exception creating temporary table

2015-03-18 Thread Todd Nist
I am attempting to access ElasticSearch and expose it’s data through SparkSQL using the elasticsearch-hadoop project. I am encountering the following exception when trying to create a Temporary table from a resource in ElasticSearch.: 15/03/18 07:54:46 INFO DAGScheduler: Job 2 finished: runJob at

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-18 Thread Dmitry Goldenberg
Thanks, Ted. Well, so far even there I'm only seeing my post and not, for example, your response. On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu wrote: > Was this one of the threads you participated ? > http://search-hadoop.com/m/JW1q5w0p8x1 > > You should be able to find your posts on search-hadoop.co

RE: topic modeling using LDA in MLLib

2015-03-18 Thread Daniel, Ronald (ELS-SDG)
Wordcount is a very common example so you can find that several places in Spark documentation and tutorials. Beware! They typically tokenize the text by splitting on whitespace. That will leave you with tokens that have trailing commas, periods, and other things. Also, you probably want to lowe

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-18 Thread Ted Yu
Was this one of the threads you participated ? http://search-hadoop.com/m/JW1q5w0p8x1 You should be able to find your posts on search-hadoop.com On Wed, Mar 18, 2015 at 3:21 PM, dgoldenberg wrote: > Sorry if this is a total noob question but is there a reason why I'm only > seeing folks' respon

Apache Spark User List: people's responses not showing in the browser view

2015-03-18 Thread dgoldenberg
Sorry if this is a total noob question but is there a reason why I'm only seeing folks' responses to my posts in emails but not in the browser view under apache-spark-user-list.1001560.n3.nabble.com? Is this a matter of setting your preferences such that your responses only go to email and never t

Spark and Morphlines, parallelization, multithreading

2015-03-18 Thread dgoldenberg
Still a Spark noob grappling with the concepts... I'm trying to grok the idea of integrating something like the Morphlines pipelining library with Spark (or SparkStreaming). The Kite/Morphlines doc states that "runtime executes all commands of a given morphline in the same thread... there are no

Re: Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-18 Thread Eugen Cepoi
Hey Dimitriy, thanks for sharing your solution. I have some more updates. The problem comes out when shuffle is involved. Using coalesce shuffle true behaves like reduceByKey+smaller num of partitions, except that the whole save stage hangs. I am not sure yet if it only happens with UnionRDD or a

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-18 Thread Todd Nist
Yes I believe you are correct. For the build you may need to specify the specific HDP version of hadoop to use with the -Dhadoop.version=. I went with the default 2.6.0, but Horton may have a vendor specific version that needs to go here. I know I saw a similar post today where the solution

Does newly-released LDA (Latent Dirichlet Allocation) algorithm supports ngrams?

2015-03-18 Thread heszak
I wonder to know whether the newly-released LDA (Latent Dirichlet Allocation) algorithm only supports uni-gram or it can also supports bi/tri-grams too? If it can, can someone help me how I can use them? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-n

Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ranga
Thanks Ted. Will do. On Wed, Mar 18, 2015 at 2:27 PM, Ted Yu wrote: > Ranga: > Please apply the patch from: > https://github.com/apache/spark/pull/4867 > > And rebuild Spark - the build would use Tachyon-0.6.1 > > Cheers > > On Wed, Mar 18, 2015 at 2:23 PM, Ranga wrote: > >> Hi Haoyuan >> >> No

Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ted Yu
Ranga: Please apply the patch from: https://github.com/apache/spark/pull/4867 And rebuild Spark - the build would use Tachyon-0.6.1 Cheers On Wed, Mar 18, 2015 at 2:23 PM, Ranga wrote: > Hi Haoyuan > > No. I assumed that Spark-1.3.0 was already built with Tachyon-0.6.0. If > not, I can rebuild

Re: RDD ordering after map

2015-03-18 Thread Burak Yavuz
Hi, Yes, ordering is preserved with map. Shuffles break ordering. Burak On Wed, Mar 18, 2015 at 2:02 PM, sergunok wrote: > Does map(...) preserve ordering of original RDD? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-ordering-after-map-tp2

Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ranga
Hi Haoyuan No. I assumed that Spark-1.3.0 was already built with Tachyon-0.6.0. If not, I can rebuild and try. Could you let me know how to rebuild with 0.6.0? Thanks for your help. - Ranga On Wed, Mar 18, 2015 at 12:59 PM, Haoyuan Li wrote: > Did you recompile it with Tachyon 0.6.0? > > Also

MEMORY_ONLY vs MEMORY_AND_DISK

2015-03-18 Thread sergunok
What persistance level is better if RDD to be cached is heavily to be recalculated? Am I right it is MEMORY_AND_DISK? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MEMORY-ONLY-vs-MEMORY-AND-DISK-tp22130.html Sent from the Apache Spark User List mailing lis

RDD ordering after map

2015-03-18 Thread sergunok
Does map(...) preserve ordering of original RDD? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-ordering-after-map-tp22129.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-18 Thread Yiannis Gkoufas
Hi Yin, Thanks for your feedback. I have 1700 parquet files, sized 100MB each. The number of tasks launched is equal to the number of parquet files. Do you have any idea on how to deal with this situation? Thanks a lot On 18 Mar 2015 17:35, "Yin Huai" wrote: > Seems there are too many distinct

topic modeling using LDA in MLLib

2015-03-18 Thread heszak
I'm coming from a Hadoop background but I'm totally new to Apache Spark. I'd like to do topic modeling using LDA algorithm on some txt files. The example on the Spark website assumes that the input to the LDA is a file containing the words counts. I wonder if someone could help me figuring out the

Re: 1.3 release

2015-03-18 Thread Eric Friedman
Sean, you are exactly right, as I learned from parsing your earlier reply more carefully -- sorry I didn't do this the first time. Setting hadoop.version was indeed the solution ./make-distribution.sh --tgz -Pyarn -Phadoop-2.4 -Phive -Phive-thriftserver -Dhadoop.version=2.5.0-cdh5.3.2 Thanks for

Re: Using a different spark jars than the one on the cluster

2015-03-18 Thread Marcelo Vanzin
Since you're using YARN, you should be able to download a Spark 1.3.0 tarball from Spark's website and use spark-submit from that installation to launch your app against the YARN cluster. So effectively you would have 1.2.0 and 1.3.0 side-by-side in your cluster. On Wed, Mar 18, 2015 at 11:09 AM,

Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Haoyuan Li
Did you recompile it with Tachyon 0.6.0? Also, Tachyon 0.6.1 has been released this morning: http://tachyon-project.org/ ; https://github.com/amplab/tachyon/releases Best regards, Haoyuan On Wed, Mar 18, 2015 at 11:48 AM, Ranga wrote: > I just tested with Spark-1.3.0 + Tachyon-0.6.0 and still

Re: mapPartitions - How Does it Works

2015-03-18 Thread Alex Turner (TMS)
List(x.next).iterator is giving you the first element from each partition, which would be 1, 4 and 7 respectively. On 3/18/15, 10:19 AM, "ashish.usoni" wrote: >I am trying to understand about mapPartitions but i am still not sure how >it >works > >in the below example it create three partition >

Re: Spark + HBase + Kerberos

2015-03-18 Thread Eric Walk
Hi Ted, The spark executors and hbase regions/masters are all collocated. This is a 2 node test environment. Best, Eric Eric Walk, Sr. Technical Consultant p: 617.855.9255 | NASDAQ: PRFT | Perficient.com From: Ted Yu Sent: Mar 18, 2015 2:46 PM To: Eric Wa

RDD pair to pair of RDDs

2015-03-18 Thread Alex Turner (TMS)
What's the best way to go from: RDD[(A, B)] to (RDD[A], RDD[B]) If I do: def separate[A, B](k: RDD[(A, B)]) = (k.map(_._1), k.map(_._2)) Which is the obvious solution, this runs two maps in the cluster. Can I do some kind of a fold instead: def separate[A, B](l: List[(A, B)]) = l.foldLeft(Li

Re: Spark + HBase + Kerberos

2015-03-18 Thread Ted Yu
Are hbase config / keytab files deployed on executor machines ? Consider adding -Dsun.security.krb5.debug=true for debug purpose. Cheers On Wed, Mar 18, 2015 at 11:39 AM, Eric Walk wrote: > Having an issue connecting to HBase from a Spark container in a Secure > Cluster. Haven’t been able to

Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ranga
I just tested with Spark-1.3.0 + Tachyon-0.6.0 and still see the same issue. Here are the logs: 15/03/18 11:44:11 ERROR : Invalid method name: 'getDataFolder' 15/03/18 11:44:11 ERROR : Invalid method name: 'user_getFileId' 15/03/18 11:44:11 ERROR storage.TachyonBlockManager: Failed 10 attempts to c

Spark Streaming S3 Performance Implications

2015-03-18 Thread Mike Trienis
Hi All, I am pushing data from Kinesis stream to S3 using Spark Streaming and noticed that during testing (i.e. master=local[2]) the batches (1 second intervals) were falling behind the incoming data stream at about 5-10 events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking at

Re: Spark + Kafka

2015-03-18 Thread Khanderao Kand Gmail
I have used various version of spark (1.0, 1.2.1) without any issues . Though I have not significantly used kafka with 1.3.0 , a preliminary testing revealed no issues . - khanderao > On Mar 18, 2015, at 2:38 AM, James King wrote: > > Hi All, > > Which build of Spark is best when using K

Using a different spark jars than the one on the cluster

2015-03-18 Thread jaykatukuri
Hi all, I am trying to run my job which needs spark-sql_2.11-1.3.0.jar. The cluster that I am running on is still on spark-1.2.0. I tried the following : spark-submit --class class-name --num-executors 100 --master yarn application_jar--jars hdfs:///path/spark-sql_2.11-1.3.0.jar hdfs:///input_d

RE: Column Similarity using DIMSUM

2015-03-18 Thread Manish Gupta 8
Hi Reza, I have tried threshold to be only in the range of 0 to 1. I was not aware that threshold can be set to above 1. Will try and update. Thank You - Manish From: Reza Zadeh [mailto:r...@databricks.com] Sent: Wednesday, March 18, 2015 10:55 PM To: Manish Gupta 8 Cc: user@spark.apache.org S

[Spark SQL] Elasticsearch-hadoop - exception when creating Temporary table

2015-03-18 Thread Todd Nist
I am attempting to access ElasticSearch and expose it’s data through SparkSQL using the elasticsearch-hadoop project. I am encountering the following exception when trying to create a Temporary table from a resource in ElasticSearch.: 15/03/18 07:54:46 INFO DAGScheduler: Job 2 finished: runJob at

Re: mapPartitions - How Does it Works

2015-03-18 Thread Ganelin, Ilya
Map partitions works as follows : 1) For each partition of your RDD, it provides an iterator over the values within that partition 2) You then define a function that operates on that iterator Thus if you do the following: val parallel = sc.parallelize(1 to 10, 3) parallel.mapPartitions( x => x.m

RE: mapPartitions - How Does it Works

2015-03-18 Thread java8964
Here is what I think: mapPartitions is for a specialized map that is called only once for each partition. The entire content of the respective partitions is available as a sequential stream of values via the input argument (Iterarator[T]). The combined result iterators are automatically converte

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-18 Thread Yin Huai
Seems there are too many distinct groups processed in a task, which trigger the problem. How many files do your dataset have and how large is a file? Seems your query will be executed with two stages, table scan and map-side aggregation in the first stage and the final round of reduce-side aggrega

Re: Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Yin Huai
Hi Roberto, For now, if the "timestamp" is a top level column (not a field in a struct), you can use use backticks to quote the column name like `timestamp `. Thanks, Yin On Wed, Mar 18, 2015 at 12:10 PM, Roberto Coluccio < roberto.coluc...@gmail.com> wrote: > Hey Cheng, thank you so much for

Re: Column Similarity using DIMSUM

2015-03-18 Thread Reza Zadeh
Hi Manish, Did you try calling columnSimilarities(threshold) with different threshold values? You try threshold values of 0.1, 0.5, 1, and 20, and higher. Best, Reza On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 wrote: > Hi, > > > > I am running Column Similarity (All Pairs Similarity using

Null pointer exception reading Parquet

2015-03-18 Thread sprookie
Hi All, I am using Saprk version 1.2 running locally. When I try to read a paquet file I get below exception, what might be the issue? Any help will be appreciated. This is the simplest operation/action on a parquet file. //code snippet// val sparkConf = new SparkConf().setAppName(" Test

Database operations on executor nodes

2015-03-18 Thread Praveen Balaji
I was wondering what people generally do about doing database operations from executor nodes. I’m (at least for now) avoiding doing database updates from executor nodes to avoid proliferation of database connections on the cluster. The general pattern I adopt is to collect queries (or tuples) on

Re: Did DataFrames break basic SQLContext?

2015-03-18 Thread Nick Pentreath
To answer your first question - yes 1.3.0 did break backward compatibility for the change from SchemaRDD -> DataFrame. SparkSQL was an alpha component so api breaking changes could happen. It is no longer an alpha component as of 1.3.0 so this will not be the case in future. Adding toDF shou

mapPartitions - How Does it Works

2015-03-18 Thread ashish.usoni
I am trying to understand about mapPartitions but i am still not sure how it works in the below example it create three partition val parallel = sc.parallelize(1 to 10, 3) and when we do below parallel.mapPartitions( x => List(x.next).iterator).collect it prints value Array[Int] = Array(1, 4,

Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ranga
Thanks for the information. Will rebuild with 0.6.0 till the patch is merged. On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu wrote: > Ranga: > Take a look at https://github.com/apache/spark/pull/4867 > > Cheers > > On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com > wrote: > >> Hi, Ranga >> >> That's

Re: How to get the cached RDD

2015-03-18 Thread praveenbalaji
sc.getPersistentRDDs(0).asInstanceOf[RDD[Array[Double]]] -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-cached-RDD-tp22114p22122.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

RE: saveAsTable fails to save RDD in Spark SQL 1.3.0

2015-03-18 Thread Shahdad Moradi
/user/hive/warehouse is a hdfs location. I’ve changed the mod for this location but I’m still having the same issue. hduser@hadoop01-VirtualBox:/opt/spark/bin$ hdfs dfs -chmod -R 777 /user/hive hduser@hadoop01-VirtualBox:/opt/spark/bin$ hdfs dfs -ls /user/hive/warehouse Found 1 items 15/03/1

Re: RDD to InputStream

2015-03-18 Thread Ayoub
In case it would interest other peoples, here is what I come up with and it seems to work fine: case class RDDAsInputStream(private val rdd: RDD[String]) extends java.io.InputStream { var bytes = rdd.flatMap(_.getBytes("UTF-8")).toLocalIterator def read(): Int = { if(bytes.hasNext

Re: Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Roberto Coluccio
Hey Cheng, thank you so much for your suggestion, the problem was actually a column/field called "timestamp" in one of the case classes!! Once I changed its name everything worked out fine again. Let me say it was kinda frustrating ... Roberto On Wed, Mar 18, 2015 at 4:07 PM, Roberto Coluccio < r

Re: Did DataFrames break basic SQLContext?

2015-03-18 Thread Justin Pihony
It appears that the metastore_db problem is related to https://issues.apache.org/jira/browse/SPARK-4758. I had another shell open that was stuck. This is probably a bug, though? import sqlContext.implicits case class Foo(x: Int) val rdd = sc.parallelize(List(Foo(1))) rdd.toDF resu

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-18 Thread Yiannis Gkoufas
Hi there, I set the executor memory to 8g but it didn't help On 18 March 2015 at 13:59, Cheng Lian wrote: > You should probably increase executor memory by setting > "spark.executor.memory". > > Full list of available configurations can be found here > http://spark.apache.org/docs/latest/configu

Did DataFrames break basic SQLContext?

2015-03-18 Thread Justin Pihony
I started to play with 1.3.0 and found that there are a lot of breaking changes. Previously, I could do the following: case class Foo(x: Int) val rdd = sc.parallelize(List(Foo(1))) import sqlContext._ rdd.registerTempTable("foo") Now, I am not able to directly use my RDD object an

Re: Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Roberto Coluccio
You know, I actually have one of the columns called "timestamp" ! This may really cause the problem reported in the bug you linked, I guess. On Wed, Mar 18, 2015 at 3:37 PM, Cheng Lian wrote: > I suspect that you hit this bug > https://issues.apache.org/jira/browse/SPARK-6250, it depends on the

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Debasish Das
There is also a batch prediction API in PR https://github.com/apache/spark/pull/3098 Idea here is what Sean said...don't try to reconstruct the whole matrix which will be dense but pick a set of users and calculate topk recommendations for them using dense level 3 blas.we are going to merge th

Column Similarity using DIMSUM

2015-03-18 Thread Manish Gupta 8
Hi, I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on a dataset that looks like (Entity, Attribute, Value) after transforming the same to a row-oriented dense matrix format (one line per Attribute, one column per Entity, each cell with normalized value – between 0 a

Re: Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Cheng Lian
I suspect that you hit this bug https://issues.apache.org/jira/browse/SPARK-6250, it depends on the actual contents of your query. Yin had opened a PR for this, although not merged yet, it should be a valid fix https://github.com/apache/spark/pull/5078 This fix will be included in 1.3.1. Ch

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Aram Mkrtchyan
Thanks gen for helpful post. Thank you Sean, we're currently exploring this world of recommendations with Spark, and your posts are very helpful to us. We've noticed that you're a co-author of "Advanced Analytics with Spark", just not to get to deep into offtopic, will it be finished soon? On Wed

Re: Difference among batchDuration, windowDuration, slideDuration

2015-03-18 Thread jaredtims
I think hsy541 is still confused by what is still confusing to me. Namely, what is the value that sentence "Each RDD in a DStream contains data from a certain interval" is speaking of? This is from the Discretized Streams

Re: Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Roberto Coluccio
Hi Cheng, thanks for your reply. The query is something like: SELECT * FROM ( > SELECT m.column1, IF (d.columnA IS NOT null, d.columnA, m.column2), ..., > m.columnN FROM tableD d RIGHT OUTER JOIN tableM m on m.column2 = d.columnA > WHERE m.column2!=\"None\" AND d.columnA!=\"\" > UNION ALL >

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-18 Thread Cheng Lian
You should probably increase executor memory by setting "spark.executor.memory". Full list of available configurations can be found here http://spark.apache.org/docs/latest/configuration.html Cheng On 3/18/15 9:15 PM, Yiannis Gkoufas wrote: Hi there, I was trying the new DataFrame API with

Re: sparksql native jdbc driver

2015-03-18 Thread Cheng Lian
Yes On 3/18/15 8:20 PM, sequoiadb wrote: hey guys, In my understanding SparkSQL only supports JDBC connection through hive thrift server, is this correct? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org Fo

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Sean Owen
I don't think that you need memory to put the whole joined data set in memory. However memory is unlikely to be the limiting factor, it's the massive shuffle. OK, you really do have a large recommendation problem if you're recommending for at least 7M users per day! My hunch is that it won't be f

Re: sparksql native jdbc driver

2015-03-18 Thread Arush Kharbanda
Yes, I have been using Spark SQL from the onset. Haven't found any other Server for Spark SQL for JDBC connectivity. On Wed, Mar 18, 2015 at 5:50 PM, sequoiadb wrote: > hey guys, > > In my understanding SparkSQL only supports JDBC connection through hive > thrift server, is this correct? > > Tha

Re: [spark-streaming] can shuffle write to disk be disabled?

2015-03-18 Thread Darren Hoo
On Wed, Mar 18, 2015 at 8:31 PM, Shao, Saisai wrote: > From the log you pasted I think this (-rw-r--r-- 1 root root 80K Mar > 18 16:54 shuffle_47_519_0.data) is not shuffle spilled data, but the > final shuffle result. > why the shuffle result is written to disk? > As I said, did you think

DataFrame operation on parquet: GC overhead limit exceeded

2015-03-18 Thread Yiannis Gkoufas
Hi there, I was trying the new DataFrame API with some basic operations on a parquet dataset. I have 7 nodes of 12 cores and 8GB RAM allocated to each worker in a standalone cluster mode. The code is the following: val people = sqlContext.parquetFile("/data.parquet"); val res = people.groupBy("na

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread gen tang
Hi, If you do cartesian join to predict users' preference over all the products, I think that 8 nodes with 64GB ram would not be enough for the data. Recently, I used als for a similar situation, but just 10M users and 0.1M products, the minimum requirement is 9 nodes with 10GB RAM. Moreover, even

Re: Spark Job History Server

2015-03-18 Thread Marcelo Vanzin
Those classes are not part of standard Spark. You may want to contact Hortonworks directly if they're suggesting you use those. On Wed, Mar 18, 2015 at 3:30 AM, patcharee wrote: > Hi, > > I am using spark 1.3. I would like to use Spark Job History Server. I added > the following line into conf/sp

Re: Spark Job History Server

2015-03-18 Thread patcharee
Hi, My spark was compiled with yarn profile, I can run spark on yarn without problem. For the spark job history server problem, I checked spark-assembly-1.3.0-hadoop2.4.0.jar and found that the package org.apache.spark.deploy.yarn.history is missing. I don't know why BR, Patcharee On

srcAttr in graph.triplets don't update when the size of graph is huge

2015-03-18 Thread 张林(林岳)
when the size of the graph is huge(0.2 billion vertex, 6 billion edges), the srcAttr and dstAttr in graph.triplets don't update when using the Graph.outerJoinVertices(when the data in vertex is changed). the code and the log is as follows: g = graph.outerJoinVertices()... g,vertices,count() g.edg

RE: [spark-streaming] can shuffle write to disk be disabled?

2015-03-18 Thread Shao, Saisai
>From the log you pasted I think this (-rw-r--r-- 1 root root 80K Mar 18 >16:54 shuffle_47_519_0.data) is not shuffle spilled data, but the final >shuffle result. As I said, did you think shuffle is the bottleneck which makes >your job running slowly? Maybe you should identify the cause at fir

sparksql native jdbc driver

2015-03-18 Thread sequoiadb
hey guys, In my understanding SparkSQL only supports JDBC connection through hive thrift server, is this correct? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spa

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Aram Mkrtchyan
Thanks much for your reply. By saying on the fly, you mean caching the trained model, and querying it for each user joined with 30M products when needed? Our question is more about the general approach, what if we have 7M DAU? How the companies deal with that using Spark? On Wed, Mar 18, 2015 a

Integration of Spark1.2.0 cdh4 with Jetty 9.2.10

2015-03-18 Thread sayantini
Hi all, We are using spark-assembly-1.2.0-hadoop 2.0.0-mr1-cdh4.2.0.jar in our application. When we try to deploy the application on Jetty (jetty-distribution-9.2.10.v20150310) we get the below exception at the server startup. Initially we were getting the below exception, Caused by: java.la

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Sean Owen
Not just the join, but this means you're trying to compute 600 trillion dot products. It will never finish fast. Basically: don't do this :) You don't in general compute all recommendations for all users, but recompute for a small subset of users that were or are likely to be active soon. (Or compu

Apache Spark ALS recommendations approach

2015-03-18 Thread Aram
Hi all, Trying to build recommendation system using Spark MLLib's ALS. Currently, we're trying to pre-build recommendations for all users on daily basis. We're using simple implicit feedbacks and ALS. The problem is, we have 20M users and 30M products, and to call the main predict() method, we n

Re: 1.3 release

2015-03-18 Thread Sean Owen
I don't think this is the problem, but I think you'd also want to set -Dhadoop.version= to match your deployment version, if you're building for a particular version, just to be safe-est. I don't recall seeing that particular error before. It indicates to me that the SparkContext is null. Is this

Apache Spark ALS recommendations approach

2015-03-18 Thread Aram Mkrtchyan
Trying to build recommendation system using Spark MLLib's ALS. Currently, we're trying to pre-build recommendations for all users on daily basis. We're using simple implicit feedbacks and ALS. The problem is, we have 20M users and 30M products, and to call the main predict() method, we need to ha

Re: Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Cheng Lian
Would you mind to provide the query? If it's confidential, could you please help constructing a query that reproduces this issue? Cheng On 3/18/15 6:03 PM, Roberto Coluccio wrote: Hi everybody, When trying to upgrade from Spark 1.1.1 to Spark 1.2.x (tried both 1.2.0 and 1.2.1) I encounter a

Re: updateStateByKey performance & API

2015-03-18 Thread Nikos Viorres
Hi Akhil, Yes, that's what we are planning on doing at the end of the data. At the moment I am doing performance testing before the job hits production and testing on 4 cores to get baseline figures and deduced that in order to grow to 10 - 15 million keys we ll need at batch interval of ~20 secs

Re: Using Spark with a SOCKS proxy

2015-03-18 Thread Akhil Das
Did you try ssh tunneling instead of SOCKS? Thanks Best Regards On Wed, Mar 18, 2015 at 5:45 AM, Kelly, Jonathan wrote: > I'm trying to figure out how I might be able to use Spark with a SOCKS > proxy. That is, my dream is to be able to write code in my IDE then run it > without much trouble

Re: Spark Job History Server

2015-03-18 Thread Akhil Das
You are not having yarn package in the classpath. You need to build your spark it with yarn. You can read these docs. Thanks Best Regards On Wed, Mar 18, 2015 at 4:07 PM, patcharee wrote: > I turned it on. But it failed to start. In the

Re: [spark-streaming] can shuffle write to disk be disabled?

2015-03-18 Thread Darren Hoo
I've already done that: >From SparkUI Environment Spark properties has: spark.shuffle.spillfalse On Wed, Mar 18, 2015 at 6:34 PM, Akhil Das wrote: > I think you can disable it with spark.shuffle.spill=false > > Thanks > Best Regards > > On Wed, Mar 18, 2015 at 3:39 PM, Darren Hoo wrote: > >>

Re: updateStateByKey performance & API

2015-03-18 Thread Akhil Das
You can always throw more machines at this and see if the performance is increasing. Since you haven't mentioned anything regarding your # cores etc. Thanks Best Regards On Wed, Mar 18, 2015 at 11:42 AM, nvrs wrote: > Hi all, > > We are having a few issues with the performance of updateStateByK

Re: Spark Job History Server

2015-03-18 Thread patcharee
I turned it on. But it failed to start. In the log, Spark assembly has been built with Hive, including Datanucleus jars on classpath Spark Command: /usr/lib/jvm/java-1.7.0-openjdk.x86_64/bin/java -cp :/root/spark-1.3.0-bin-hadoop2.4/sbin/../conf:/root/spark-1.3.0-bin-hadoop2.4/lib/spark-assembl

Re: Spark Job History Server

2015-03-18 Thread Akhil Das
You can simply turn it on using: ./sbin/start-history-server.sh ​Read more here .​ Thanks Best Regards On Wed, Mar 18, 2015 at 4:00 PM, patcharee wrote: > Hi, > > I am using spark 1.3. I would like to use Spark Job History Server. I > adde

Re: [spark-streaming] can shuffle write to disk be disabled?

2015-03-18 Thread Akhil Das
I think you can disable it with spark.shuffle.spill=false Thanks Best Regards On Wed, Mar 18, 2015 at 3:39 PM, Darren Hoo wrote: > Thanks, Shao > > On Wed, Mar 18, 2015 at 3:34 PM, Shao, Saisai > wrote: > >> Yeah, as I said your job processing time is much larger than the >> sliding window, a

Spark Job History Server

2015-03-18 Thread patcharee
Hi, I am using spark 1.3. I would like to use Spark Job History Server. I added the following line into conf/spark-defaults.conf spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService spark.history.provider org.apache.spark.deploy.yarn.history.YarnHistoryProvider spark.y

  1   2   >