Re: how to clean shuffle write each iteration

2015-03-03 Thread lisendong
in ALS, I guess all the iteration’s rdds are referenced by its next iteration’s rdd, so all the shuffle data will not be deleted until the als job finished… I guess checkpoint could solve my problem, do you know checkpoint? 在 2015年3月3日,下午4:18,nitin [via Apache Spark User List]

RE: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Cheng, Hao
Can you provide the detailed failure call stack? From: shahab [mailto:shahab.mok...@gmail.com] Sent: Tuesday, March 3, 2015 3:52 PM To: user@spark.apache.org Subject: Supporting Hive features in Spark SQL Thrift JDBC server Hi, According to Spark SQL documentation, Spark SQL supports the

Re: Exception while select into table.

2015-03-03 Thread LinQili
Hi Yi, Thanks for your reply. 1. The version of spark is 1.2.0 and the version of hive is 0.10.0-cdh4.2.1. 2. The full trace stack of the exception: 15/03/03 13:41:30 INFO Client: client token: DUrrav1rAADCnhQzX_Ic6CMnfqcW2NIxra5n8824CRFZQVJOX0NMSUVOVF9UT0tFTgA diagnostics: User

Re: how to clean shuffle write each iteration

2015-03-03 Thread nitin
Shuffle write will be cleaned if it is not referenced by any object directly/indirectly. There is a garbage collector written inside spark which periodically checks for weak references to RDDs/shuffle write/broadcast and deletes them. -- View this message in context:

SparkSQL, executing an OR

2015-03-03 Thread Guillermo Ortiz
I'm trying to execute a query with Spark. (Example from the Spark Documentation) val teenagers = people.where('age = 10).where('age = 19).select('name) Is it possible to execute an OR with this syntax? val teenagers = people.where('age = 10 'or 'age = 4).where('age = 19).select('name) I have

Re: GraphX path traversal

2015-03-03 Thread Madabhattula Rajesh Kumar
Hi, Could you please let me know how to do this? (or) Any suggestion Regards, Rajesh On Mon, Mar 2, 2015 at 4:47 PM, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hi, I have a below edge list. How to find the parents path for every vertex? Example : Vertex 1 path : 2, 3, 4, 5, 6

RE: SparkSQL, executing an OR

2015-03-03 Thread Cheng, Hao
Using where('age =10 'age =4) instead. -Original Message- From: Guillermo Ortiz [mailto:konstt2...@gmail.com] Sent: Tuesday, March 3, 2015 5:14 PM To: user Subject: SparkSQL, executing an OR I'm trying to execute a query with Spark. (Example from the Spark Documentation) val teenagers

RE: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Cheng, Hao
Hive UDF are only applicable for HiveContext and its subclass instance, is the CassandraAwareSQLContext a direct sub class of HiveContext or SQLContext? From: shahab [mailto:shahab.mok...@gmail.com] Sent: Tuesday, March 3, 2015 5:10 PM To: Cheng, Hao Cc: user@spark.apache.org Subject: Re:

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-03 Thread Jaonary Rabarisoa
Here is my current implementation with current master version of spark *class DeepCNNFeature extends Transformer with HasInputCol with HasOutputCol ... { override def transformSchema(...) { ... }* *override def transform(dataSet: DataFrame, paramMap: ParamMap): DataFrame = {* *

LATERAL VIEW explode requests the full schema

2015-03-03 Thread matthes
I use LATERAL VIEW explode(...) to read data from a parquet-file but the full schema is requeseted by parquet instead just the used columns. When I didn't use LATERAL VIEW the requested schema has just the two columns which I use. Is it correct or is there place for an optimization or do I

RE: insert Hive table with RDD

2015-03-03 Thread Cheng, Hao
Using the SchemaRDD / DataFrame API via HiveContext Assume you're using the latest code, something probably like: val hc = new HiveContext(sc) import hc.implicits._ existedRdd.toDF().insertInto(hivetable) or existedRdd.toDF().registerTempTable(mydata) hc.sql(insert into hivetable as select xxx

Re: Running Spark jobs via oozie

2015-03-03 Thread nitinkak001
I am also starting to work on this one. Did you get any solution to this issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-jobs-via-oozie-tp5187p21896.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: GraphX path traversal

2015-03-03 Thread Madabhattula Rajesh Kumar
Hi Robin, Thank you for your response. Please find below my question. I have a below edge file Source Vertex Destination Vertex 1 2 2 3 3 4 4 5 5 6 6 6 In this graph 1st vertex is connected to 2nd vertex, 2nd Vertex is connected to 3rd vertex,. 6th vertex is connected to 6th vertex.

Re: On app upgrade, restore sliding window data.

2015-03-03 Thread Matus Faro
Thank you Arush, I've implemented initial data for a windowed operation and opened a pull request here: https://github.com/apache/spark/pull/4875 On Tue, Feb 24, 2015 at 4:49 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: I think this could be of some help to you.

insert Hive table with RDD

2015-03-03 Thread patcharee
Hi, How can I insert an existing hive table with an RDD containing my data? Any examples? Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Re: gc time too long when using mllib als

2015-03-03 Thread Akhil Das
You need to increase the parallelism/repartition the data to a higher number to get ride of those. Thanks Best Regards On Tue, Mar 3, 2015 at 2:26 PM, lisendong lisend...@163.com wrote: why does the gc time so long? i 'm using als in mllib, while the garbage collection time is too long

Re: RDD partitions per executor in Cassandra Spark Connector

2015-03-03 Thread Pavel Velikhov
Hi, is there a paper or a document where one can read how Spark reads Cassandra data in parallel? And how it writes data back from RDDs? Its a bit hard to have a clear picture in mind. Thank you, Pavel Velikhov On Mar 3, 2015, at 1:08 AM, Rumph, Frens Jan m...@frensjan.nl wrote: Hi all,

Re: One of the executor not getting StopExecutor message

2015-03-03 Thread twinkle sachdeva
Hi, Operations are not very extensive, as this scenario is not always reproducible. One of the executor start behaving in this manner. For this particular application, we are using 8 cores in one executors, and practically, 4 executors are launched on one machine. This machine has good config

delay between removing the block manager of an executor, and marking that as lost

2015-03-03 Thread twinkle sachdeva
Hi, Is there any relation between removing block manager of an executor and marking that as lost? In my setup,even after removing block manager ( after failing to do some operation )...it is taking more than 20 mins, to mark that as lost executor. Following are the logs: *15/03/03 10:26:49

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Yin Huai
Regarding current_date, I think it is not in either Hive 0.12.0 or 0.13.1 (versions that we support). Seems https://issues.apache.org/jira/browse/HIVE-5472 added it Hive recently. On Tue, Mar 3, 2015 at 6:03 AM, Cheng, Hao hao.ch...@intel.com wrote: The temp table in metastore can not be

Re: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: pyspark on yarn

2015-03-03 Thread Gustavo Enrique Salazar Torres
Hi Sam: Shouldn't you define the table schema? I had the same problem in Scala and then I solved it defining the schema. I did this: sqlContext.applySchema(dataRDD, tableSchema).registerTempTable(tableName) Hope it helps. On Mon, Jan 5, 2015 at 7:01 PM, Sam Flint sam.fl...@magnetic.com wrote:

Re: PRNG in Scala

2015-03-03 Thread Robin East
And this SO post goes into details on the PRNG in Java http://stackoverflow.com/questions/9907303/does-java-util-random-implementation-differ-between-jres-or-platforms On 3 Mar 2015, at 16:15, Robin East robin.e...@xense.co.uk wrote: This is more of a java/scala question than spark - it uses

Issue with yarn cluster - hangs in accepted state.

2015-03-03 Thread abhi
I am trying to run below java class with yarn cluster, but it hangs in accepted state . i don't see any error . Below is the class and command . Any help is appreciated . Thanks, Abhi bin/spark-submit --class com.mycompany.app.SimpleApp --master yarn-cluster /home/hduser/my-app-1.0.jar

PRNG in Scala

2015-03-03 Thread Vijayasarathy Kannan
Hi, What pseudo-random-number generator does scala.util.Random uses?

Re: GraphX path traversal

2015-03-03 Thread Madabhattula Rajesh Kumar
Hi, I have tried below program using pergel API but I'm not able to get my required output. I'm getting exactly reverse output which I'm expecting. // Creating graph using above mail mentioned edgefile val graph: Graph[Int, Int] = GraphLoader.edgeListFile(sc,

Re: GraphX path traversal

2015-03-03 Thread Robin East
Have you tried EdgeDirection.In? On 3 Mar 2015, at 16:32, Robin East robin.e...@xense.co.uk wrote: What about the following which can be run in spark shell: import org.apache.spark._ import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD val vertexlist = Array((1L,One),

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Rohit Rai
Hello Shahab, I think CassandraAwareHiveContext https://github.com/tuplejump/calliope/blob/develop/sql/hive/src/main/scala/org/apache/spark/sql/hive/CassandraAwareHiveContext.scala in Calliopee is what you are looking for. Create CAHC instance and you should be able to run hive functions against

Resource manager UI for Spark applications

2015-03-03 Thread Rohini joshi
Hi , I have 2 questions - 1. I was trying to use Resource Manager UI for my SPARK application using yarn cluster mode as I observed that spark UI does not work for Yarn-cluster. IS that correct or am I missing some setup?

Re: PRNG in Scala

2015-03-03 Thread Robin East
This is more of a java/scala question than spark - it uses java.util.Random : https://github.com/scala/scala/blob/2.11.x/src/library/scala/util/Random.scala On 3 Mar 2015, at 15:08, Vijayasarathy Kannan kvi...@vt.edu wrote: Hi, What pseudo-random-number generator does scala.util.Random

Re: GraphX path traversal

2015-03-03 Thread Robin East
What about the following which can be run in spark shell: import org.apache.spark._ import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD val vertexlist = Array((1L,One), (2L,Two), (3L,Three), (4L,Four),(5L,Five),(6L,Six)) val edgelist = Array(Edge(6,5,6 to 5),Edge(5,4,5 to

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
@Cheng :My problem is that the connector I use to query Spark does not support latest Hive (0.12, 0.13), But I need to perform Hive Queries on data retrieved from Cassandra. I assumed that if I get data out of cassandra in some way and register it as Temp table I would be able to query it using

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
Thanks Rohit, I am already using Calliope and quite happy with it, well done ! except the fact that : 1- It seems that it does not support Hive 0.12 or higher, Am i right? for example you can not use : current_time() UDF, or those new UDFs added in hive 0.12 . Are they supported? Any plan for

Re: SparkSQL, executing an OR

2015-03-03 Thread Guillermo Ortiz
thanks, it works. 2015-03-03 13:32 GMT+01:00 Cheng, Hao hao.ch...@intel.com: Using where('age =10 'age =4) instead. -Original Message- From: Guillermo Ortiz [mailto:konstt2...@gmail.com] Sent: Tuesday, March 3, 2015 5:14 PM To: user Subject: SparkSQL, executing an OR I'm trying

Re: Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-03-03 Thread Imran Rashid
the scala syntax for arrays is Array[T], not T[], so you want to use something: kryo.register(classOf[Array[org.roaringbitmap.RoaringArray$Element]]) kryo.register(classOf[Array[Short]]) nonetheless, the spark should take care of this itself. I'll look into it later today. On Mon, Mar 2, 2015

RE: java.lang.IncompatibleClassChangeError when using PrunedFilteredScan

2015-03-03 Thread Cheng, Hao
As the call stack shows, the mongodb connector is not compatible with the Spark SQL Data Source interface. The latest Data Source API is changed since 1.2, probably you need to confirm which spark version the MongoDB Connector build against. By the way, a well format call stack will be more

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
You are right , CassandraAwareSQLContext is subclass of SQL context. But I did another experiment, I queried Cassandra using CassandraAwareSQLContext, then I registered the rdd as a temp table , next I tried to query it using HiveContext, but it seems that hive context can not see the registered

spark.local.dir leads to Job cancelled because SparkContext was shut down

2015-03-03 Thread lisendong
As long as I set the spark.local.dir to multiple disks, the job will failed, the errors are as follow: (if I set the spark.local.dir to only 1 dir, the job will succed...) Exception in thread main org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at

[no subject]

2015-03-03 Thread shahab
I did an experiment with Hive and SQL context , I queried Cassandra using CassandraAwareSQLContext (a custom SQL context from Calliope) , then I registered the rdd as a temp table , next I tried to query it using HiveContext, but it seems that hive context can not see the registered table suing

Re: Resource manager UI for Spark applications

2015-03-03 Thread Rohini joshi
Sorry , for half email - here it is again in full Hi , I have 2 questions - 1. I was trying to use Resource Manager UI for my SPARK application using yarn cluster mode as I observed that spark UI does not work for Yarn-cluster. IS that correct or am I missing some setup? 2. when I click on

Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-03 Thread Jaonary Rabarisoa
Dear all, Is there a least square solver based on DistributedMatrix that we can use out of the box in the current (or the master) version of spark ? It seems that the only least square solver available in spark is private to recommender package. Cheers, Jao

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Rohit Rai
The Hive dependency comes from spark-hive. It does work with Spark 1.1 we will have the 1.2 release later this month. On Mar 3, 2015 8:49 AM, shahab shahab.mok...@gmail.com wrote: Thanks Rohit, I am already using Calliope and quite happy with it, well done ! except the fact that : 1- It

Re: Spark Error: Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077

2015-03-03 Thread Krishnanand Khambadkone
Hello Ted,  Some progress,  now it seems that the spark job does get submitted,  In the spark web UI, I do see this under finished drivers.  However, it seems to not go past this step,  JavaPairReceiverInputDStreamString, String messages = KafkaUtils.createStream(jsc, localhost:2181, aa,

Re: Resource manager UI for Spark applications

2015-03-03 Thread Ted Yu
bq. spark UI does not work for Yarn-cluster. Can you be a bit more specific on the error(s) you saw ? What Spark release are you using ? Cheers On Tue, Mar 3, 2015 at 8:53 AM, Rohini joshi roni.epi...@gmail.com wrote: Sorry , for half email - here it is again in full Hi , I have 2

[no subject]

2015-03-03 Thread Jianshi Huang
Hi, I got this error message: 15/03/03 10:22:41 ERROR OneForOneBlockFetcher: Failed while starting block fetches java.lang.RuntimeException: java.io.FileNotFoundException:

Re: Spark Monitoring UI for Hadoop Yarn Cluster

2015-03-03 Thread Todd Nist
Hi Srini, If you start the $SPARK_HOME/sbin/start-history-server, you should be able to see the basic spark ui. You will not see the master, but you will be able to see the rest as I recall. You also need to add an entry into the spark-defaults.conf, something like this: *## Make sure the host

Re: Resource manager UI for Spark applications

2015-03-03 Thread Ted Yu
bq. changing the address with internal to the external one , but still does not work. Not sure what happened. For the time being, you can use yarn command line to pull container log (put in your appId and container Id): yarn logs -applicationId application_1386639398517_0007 -containerId

Why different numbers of partitions give different results for the same computation on the same dataset?

2015-03-03 Thread Saiph Kappa
Hi, I have a spark streaming application, running on a single node, consisting mainly of map operations. I perform repartitioning to control the number of CPU cores that I want to use. The code goes like this: val ssc = new StreamingContext(sparkConf, Seconds(5)) val distFile =

Re: Resource manager UI for Spark applications

2015-03-03 Thread roni
Hi Ted, I used s3://support.elasticmapreduce/spark/install-spark to install spark on my EMR cluster. It is 1.2.0. When I click on the link for history or logs it takes me to http://ip-172-31-43-116.us-west-2.compute.internal:9035/node/containerlogs/container_1424105590052_0070_01_01/hadoop

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Aaron Davidson
Failed to connect implies that the executor at that host died, please check its logs as well. On Tue, Mar 3, 2015 at 11:03 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Sorry that I forgot the subject. And in the driver, I got many FetchFailedException. The error messages are 15/03/03

Re: Spark Monitoring UI for Hadoop Yarn Cluster

2015-03-03 Thread Marcelo Vanzin
Spark applications shown in the RM's UI should have an Application Master link when they're running. That takes you to the Spark UI for that application where you can see all the information you're looking for. If you're running a history server and add spark.yarn.historyServer.address to your

Re: Issue using S3 bucket from Spark 1.2.1 with hadoop 2.4

2015-03-03 Thread Ted Yu
If you can use hadoop 2.6.0 binary, you can use s3a s3a is being polished in the upcoming 2.7.0 release: https://issues.apache.org/jira/browse/HADOOP-11571 Cheers On Tue, Mar 3, 2015 at 9:44 AM, Ankur Srivastava ankur.srivast...@gmail.com wrote: Hi, We recently upgraded to Spark 1.2.1 -

Re: Issue using S3 bucket from Spark 1.2.1 with hadoop 2.4

2015-03-03 Thread Ankur Srivastava
Thanks a lot Ted!! On Tue, Mar 3, 2015 at 9:53 AM, Ted Yu yuzhih...@gmail.com wrote: If you can use hadoop 2.6.0 binary, you can use s3a s3a is being polished in the upcoming 2.7.0 release: https://issues.apache.org/jira/browse/HADOOP-11571 Cheers On Tue, Mar 3, 2015 at 9:44 AM, Ankur

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-03 Thread Shivaram Venkataraman
There are couple of solvers that I've written that is part of the AMPLab ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are interested in porting them I'd be happy to review it Thanks Shivaram [1]

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Jianshi Huang
Sorry that I forgot the subject. And in the driver, I got many FetchFailedException. The error messages are 15/03/03 10:34:32 WARN TaskSetManager: Lost task 31.0 in stage 2.2 (TID 7943, ): FetchFailed(BlockManagerId(86, , 43070), shuffleId=0, mapId=24, reduceId=1220, message=

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
@Yin: sorry for my mistake, you are right it was added in 1.2, not 0.12.0 , my bad! On Tue, Mar 3, 2015 at 6:47 PM, shahab shahab.mok...@gmail.com wrote: Thanks Rohit, yes my mistake, it does work with 1.1 ( I am actually running it on spark 1.1) But do you mean that even HiveConext of

UnsatisfiedLinkError related to libgfortran when running MLLIB code on RHEL 5.8

2015-03-03 Thread Prashant Sharma
Hi Folks, We are trying to run the following code from the spark shell in a CDH 5.3 cluster running on RHEL 5.8. *spark-shell --master yarn --deploy-mode client --num-executors 15 --executor-cores 6 --executor-memory 12G * *import org.apache.spark.mllib.recommendation.ALS * *import

Re: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: pyspark on yarn

2015-03-03 Thread Michael Armbrust
In Spark 1.2 you'll have to create a partitioned hive table https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddPartitions in order to read parquet data in this format. In Spark 1.3 the parquet data source will auto discover partitions when they are laid out

Re: LBGFS optimizer performace

2015-03-03 Thread Joseph Bradley
Is that error actually occurring in LBFGS? It looks like it might be happening before the data even gets to LBFGS. (Perhaps the outer join you're trying to do is making the dataset size explode a bit.) Are you able to call count() (or any RDD action) on the data before you pass it to LBFGS? On

Re: LATERAL VIEW explode requests the full schema

2015-03-03 Thread Michael Armbrust
I believe that this has been optimized https://github.com/apache/spark/commit/2a36292534a1e9f7a501e88f69bfc3a09fb62cb3 in Spark 1.3. On Tue, Mar 3, 2015 at 4:36 AM, matthes matthias.diekst...@web.de wrote: I use LATERAL VIEW explode(...) to read data from a parquet-file but the full schema is

Re: throughput in the web console?

2015-03-03 Thread Saiph Kappa
Sorry I made a mistake. Please ignore my question. On Tue, Mar 3, 2015 at 2:47 AM, Saiph Kappa saiph.ka...@gmail.com wrote: I performed repartitioning and everything went fine with respect to the number of CPU cores being used (and respective times). However, I noticed something very strange:

Re: Resource manager UI for Spark applications

2015-03-03 Thread roni
Ted, If the application is running then the logs are not available. Plus what i want to view is the details about the running app as in spark UI. Do I have to open some ports or do some other setting changes? On Tue, Mar 3, 2015 at 10:08 AM, Ted Yu yuzhih...@gmail.com wrote: bq. changing the

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-03 Thread Joseph Bradley
The minimization problem you're describing in the email title also looks like it could be solved using the RidgeRegression solver in MLlib, once you transform your DistributedMatrix into an RDD[LabeledPoint]. On Tue, Mar 3, 2015 at 11:02 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu

Re: Why different numbers of partitions give different results for the same computation on the same dataset?

2015-03-03 Thread Tathagata Das
You can use DStream.transform() to do any arbitrary RDD transformations on the RDDs generated by a DStream. val coalescedDStream = myDStream.transform { _.coalesce(...) } On Tue, Mar 3, 2015 at 1:47 PM, Saiph Kappa saiph.ka...@gmail.com wrote: Sorry I made a mistake in my code. Please ignore

Re: insert Hive table with RDD

2015-03-03 Thread Jagat Singh
Will this recognize the hive partitions as well. Example insert into specific partition of hive ? On Tue, Mar 3, 2015 at 11:42 PM, Cheng, Hao hao.ch...@intel.com wrote: Using the SchemaRDD / DataFrame API via HiveContext Assume you're using the latest code, something probably like: val hc

Re: Can not query TempTable registered by SQL Context using HiveContext

2015-03-03 Thread Michael Armbrust
As it says in the API docs https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD, tables created with registerTempTable are local to the context that creates them: ... The lifetime of this temporary table is tied to the SQLContext

dynamically change receiver for a spark stream

2015-03-03 Thread Islem
Hi all, i have been trying to setup a stream using a custom receiver that would pick up data from twitter using follow function to listen just to some users . I'd like to keep that stream context running and dynamically change the custom receiver by adding ids of users that I'd listen to .

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-03 Thread Joseph Bradley
I see. I think your best bet is to create the cnnModel on the master and then serialize it to send to the workers. If it's big (1M or so), then you can broadcast it and use the broadcast variable in the UDF. There is not a great way to do something equivalent to mapPartitions with UDFs right

Re: Issue with yarn cluster - hangs in accepted state.

2015-03-03 Thread Zhan Zhang
Do you have enough resource in your cluster? You can check your resource manager to see the usage. Thanks. Zhan Zhang On Mar 3, 2015, at 8:51 AM, abhi abhishek...@gmail.commailto:abhishek...@gmail.com wrote: I am trying to run below java class with yarn cluster, but it hangs in accepted

Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread S. Zhou
I did some experiments and it seems not. But I like to get confirmation (or perhaps I missed something). If it does support, could u let me know how to specify multiple folders? Thanks. Senqiang 

Re: gc time too long when using mllib als

2015-03-03 Thread Xiangrui Meng
Also try 1.3.0-RC1 or the current master. ALS should performance much better in 1.3. -Xiangrui On Tue, Mar 3, 2015 at 1:00 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You need to increase the parallelism/repartition the data to a higher number to get ride of those. Thanks Best Regards

Re: Resource manager UI for Spark applications

2015-03-03 Thread Zhan Zhang
In Yarn (Cluster or client), you can access the spark ui when the app is running. After app is done, you can still access it, but need some extra setup for history server. Thanks. Zhan Zhang On Mar 3, 2015, at 10:08 AM, Ted Yu yuzhih...@gmail.commailto:yuzhih...@gmail.com wrote: bq.

Re: Why different numbers of partitions give different results for the same computation on the same dataset?

2015-03-03 Thread Saiph Kappa
Sorry I made a mistake in my code. Please ignore my question number 2. Different numbers of partitions give *the same* results! On Tue, Mar 3, 2015 at 7:32 PM, Saiph Kappa saiph.ka...@gmail.com wrote: Hi, I have a spark streaming application, running on a single node, consisting mainly of

Re: Resource manager UI for Spark applications

2015-03-03 Thread roni
ah!! I think I know what you mean. My job was just in accepted stage for a long time as it was running a huge file. But now that it is in running stage , I can see it . I can see it at post 9046 though instead of 4040 . But I can see it. Thanks -roni On Tue, Mar 3, 2015 at 1:19 PM, Zhan Zhang

how to save Word2VecModel

2015-03-03 Thread anupamme
Hello I started using spark. I am working with Word2VecModel. However I am not able to save the trained model. Here is what I am doing: inp = sc.textFile(/Users/mediratta/code/word2vec/trunk-d/sub-5).map(lambda row: row.split( )) word2vec = Word2Vec() model = word2vec.fit(inp) out =

Re: Spark Streaming Switchover Time

2015-03-03 Thread Tathagata Das
Can you elaborate on what is this switchover time? TD On Tue, Mar 3, 2015 at 9:57 PM, Nastooh Avessta (navesta) nave...@cisco.com wrote: Hi On a standalone, Spark 1.0.0, with 1 master and 2 workers, operating in client mode, running a udp streaming application, I am noting around 2

Connecting a PHP/Java applications to Spark SQL Thrift Server

2015-03-03 Thread fanooos
We have installed hadoop cluster with hive and spark and the spark sql thrift server is up and running without any problem. Now we have set of applications need to use spark sql thrift server to query some data. Some of these applications are java applications and the others are PHP

Re: UnsatisfiedLinkError related to libgfortran when running MLLIB code on RHEL 5.8

2015-03-03 Thread Xiangrui Meng
libgfortran.x86_64 4.1.2-52.el5_8.1 comes with libgfortran.so.1 but not libgfortran.so.3. JBLAS requires the latter. If you have root access, you can try to install a newer version of libgfortran. Otherwise, maybe you can try Spark 1.3, which doesn't use JBLAS in ALS. -Xiangrui On Tue, Mar 3,

Re: Can not query TempTable registered by SQL Context using HiveContext

2015-03-03 Thread shahab
Thanks Michael. I understand now. best, /Shahab On Tue, Mar 3, 2015 at 9:38 PM, Michael Armbrust mich...@databricks.com wrote: As it says in the API docs https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD, tables created with registerTempTable are local

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Jianshi Huang
Hmm... ok, previous errors are still block fetch errors. 15/03/03 10:22:40 ERROR RetryingBlockFetcher: Exception while beginning fetch of 11 outstanding blocks java.io.IOException: Failed to connect to host-/:55597 at

java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils

2015-03-03 Thread Krishnanand Khambadkone
Hi,  When I submit my spark job, I see the following runtime exception in the log, Exception in thread Thread-1 java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils at SparkHdfs.run(SparkHdfs.java:56) Caused by: java.lang.ClassNotFoundException:

Spark sql results can't be printed out to system console from spark streaming application

2015-03-03 Thread Cui Lin
Dear all, I found the below sample code can be printed out only in spark shell, but when I moved them into my spark streaming application, nothing can be printed out into system console. Can you explain why it happened? anything related to new spark context? Thanks a lot! val anotherPeopleRDD

Re: Spark sql results can't be printed out to system console from spark streaming application

2015-03-03 Thread Tobias Pfeiffer
Hi, can you explain how you copied that into your *streaming* application? Like, how do you issue the SQL, what data do you operate on, how do you view the logs etc.? Tobias On Wed, Mar 4, 2015 at 8:55 AM, Cui Lin cui@hds.com wrote: Dear all, I found the below sample code can be

Re: Spark Streaming Switchover Time

2015-03-03 Thread Tathagata Das
I am confused. Are you killing the 1st worker node to see whether the system restarts the receiver on the second worker? TD On Tue, Mar 3, 2015 at 10:49 PM, Nastooh Avessta (navesta) nave...@cisco.com wrote: This is the time that it takes for the driver to start receiving data once again,

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-03 Thread Jaonary
If think it will be interesting to have the equivalents of mappartitions with dataframe. There are many use cases where data are processed in batch. Another example is a simple linear classifier Ax=b where A is the matrix of feature vectors, x the model and b the output. Here again the product

RE: Connecting a PHP/Java applications to Spark SQL Thrift Server

2015-03-03 Thread nate
SparkSQL supports JDBC/ODBC connectivity, so if that's the route you needed/wanted to connect through you could do so via java/php apps. Havent used either so cant speak to the developer experience, assume its pretty good as would be preferred method for lots of third party enterprise

Re: One of the executor not getting StopExecutor message

2015-03-03 Thread Akhil Das
Not quite sure, but you can try increasing the spark.akka.threads, most likely it can be a yarn related issue. Thanks Best Regards On Tue, Mar 3, 2015 at 3:38 PM, twinkle sachdeva twinkle.sachd...@gmail.com wrote: Hi, Operations are not very extensive, as this scenario is not always

RE: Spark Streaming Switchover Time

2015-03-03 Thread Nastooh Avessta (navesta)
This is the time that it takes for the driver to start receiving data once again, from the 2nd worker, when the 1st worker, where streaming thread was initially running, is shutdown. Cheers, [http://www.cisco.com/web/europe/images/email/signature/logo05.jpg] Nastooh Avessta ENGINEER.SOFTWARE

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread S. Zhou
Thanks Ted. Actually a follow up question. I need to read multiple HDFS files into RDD. What I am doing now is: for each file I read them into a RDD. Then later on I union all these RDDs into one RDD. I am not sure if it is the best way to do it. ThanksSenqiang On Tuesday, March 3, 2015

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Ted Yu
Looking at FileInputFormat#listStatus(): // Whether we need to recursive look into the directory structure boolean recursive = job.getBoolean(INPUT_DIR_RECURSIVE, false); where: public static final String INPUT_DIR_RECURSIVE = mapreduce.input.fileinputformat.input.dir.recursive;

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Ted Yu
Thanks for the confirmation, Stephen. On Tue, Mar 3, 2015 at 3:53 PM, Stephen Boesch java...@gmail.com wrote: Thanks, I was looking at an old version of FileInputFormat.. BEFORE setting the recursive config ( mapreduce.input.fileinputformat.input.dir.recursive) scala

Re: ImportError: No module named iter ... (on CDH5 v1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch) ...

2015-03-03 Thread Marcelo Vanzin
Weird python errors like this generally mean you have different versions of python in the nodes of your cluster. Can you check that? On Tue, Mar 3, 2015 at 4:21 PM, subscripti...@prismalytics.io subscripti...@prismalytics.io wrote: Hi Friends: We noticed the following in 'pyspark' happens when

Re: Issue with yarn cluster - hangs in accepted state.

2015-03-03 Thread Tobias Pfeiffer
Hi, On Wed, Mar 4, 2015 at 6:20 AM, Zhan Zhang zzh...@hortonworks.com wrote: Do you have enough resource in your cluster? You can check your resource manager to see the usage. Yep, I can confirm that this is a very annoying issue. If there is not enough memory or VCPUs available, your app

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Stephen Boesch
Thanks, I was looking at an old version of FileInputFormat.. BEFORE setting the recursive config ( mapreduce.input.fileinputformat.input.dir.recursive) scala sc.textFile(dev/*).count java.io.IOException: *Not a file*: file:/shared/sparkup/dev/audit-release/blank_maven_build The default is

TreeNodeException: Unresolved attributes

2015-03-03 Thread Anusha Shamanur
Hi, I am trying to run a simple select query on a table. val restaurants=hiveCtx.hql(select * from TableName where column like '%SomeString%' ) This gives an error as below: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: How do I solve this?

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Stephen Boesch
The sc.textFile() invokes the Hadoop FileInputFormat via the (subclass) TextInputFormat. Inside the logic does exist to do the recursive directory reading - i.e. first detecting if an entry were a directory and if so then descending: for (FileStatus

Re: LBGFS optimizer performace

2015-03-03 Thread Joseph Bradley
I would recommend caching; if you can't persist, iterative algorithms will not work well. I don't think calling count on the dataset is problematic; every iteration in LBFGS iterates over the whole dataset and does a lot more computation than count(). It would be helpful to see some error

ImportError: No module named iter ... (on CDH5 v1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch) ...

2015-03-03 Thread subscripti...@prismalytics.io
Hi Friends: We noticed the following in 'pyspark' happens when running in distributed Standalone Mode (MASTER=spark://vps00:7077), but not in Local Mode (MASTER=local[n]). See the following, particularly what is highlighted in *Red* (again the problem only happens in Standalone Mode). Any

RE: Spark SQL Thrift Server start exception : java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory

2015-03-03 Thread Cheng, Hao
Which version / distribution are you using? Please references this blog that Felix C posted if you’re running on CDH. http://eradiating.wordpress.com/2015/02/22/getting-hivecontext-to-work-in-cdh/ Or you may also need to download the datanucleus*.jar files try to add the option of “--jars”

Re: LBGFS optimizer performace

2015-03-03 Thread Gustavo Enrique Salazar Torres
Yeah, I can call count before that and it works. Also I was over caching tables but I removed those. Now there is no caching but it gets really slow since it calculates my table RDD many times. Also hacked the LBFGS code to pass the number of examples which I calculated outside in a Spark SQL

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Ted Yu
Looking at scaladoc: /** Get an RDD for a Hadoop file with an arbitrary new API InputFormat. */ def newAPIHadoopFile[K, V, F : NewInputFormat[K, V]] Your conclusion is confirmed. On Tue, Mar 3, 2015 at 1:59 PM, S. Zhou myx...@yahoo.com.invalid wrote: I did some experiments and it seems

  1   2   >