Re: Akka Connection refused - standalone cluster using spark-0.9.0

2014-05-28 Thread jaranda
Same here, got stuck at this point. Any hints on what might be going on? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Akka-Connection-refused-standalone-cluster-using-spark-0-9-0-tp1297p6463.html Sent from the Apache Spark User List mailing list archive

Problem using Spark with Hbase

2014-05-28 Thread Vibhor Banga
Hi all, I am facing issues while using spark with HBase. I am getting NullPointerException at org.apache.hadoop.hbase.TableName.valueOf (TableName.java:288) Can someone please help to resolve this issue. What am I missing ? I am using following snippet of code - Configuration config =

Inter and Inra Cluster Density in KMeans

2014-05-28 Thread Stuti Awasthi
Hi, I wanted to calculate the InterClusterDensity and IntraClusterDensity from the clusters generated from KMeans. How can I achieve that? Is there any already present code/api to use for this purpose. Thanks Stuti Awasthi ::DISCLAIMER::

Re: Akka Connection refused - standalone cluster using spark-0.9.0

2014-05-28 Thread Gino Bustelo
I've been playing with the amplab docker scripts and I needed to set spark.driver.host to the driver host ip. One that all spark processes can get to. On May 28, 2014, at 4:35 AM, jaranda jordi.ara...@bsc.es wrote: Same here, got stuck at this point. Any hints on what might be going on?

Re: Writing RDDs from Python Spark progrma (pyspark) to HBase

2014-05-28 Thread Nick Pentreath
It's not possible currently to write anything other than text (or pickle files I think in 1.0.0 or if not then in 1.0.1) from PySpark. I have an outstanding pull request to add READING any InputFormat from PySpark, and after that is in I will look into OutputFormat too. What does your data look

Reading bz2 files that do not end with .bz2

2014-05-28 Thread Laurent T
Hi, I have a bunch of files that are bz2 compressed but do not have the extension .bz2 Is there anyway to force spark to read them as bz2 files using sc.textFile ? FYI, if i add the .bz2 extension to the file it works fine but the process that creates those files can't do that and i'd like to

Re: Reading bz2 files that do not end with .bz2

2014-05-28 Thread Mayur Rustagi
You can use Hadoop APi provide input/output reader hadoop configuration file to read the data. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, May 28, 2014 at 7:22 PM, Laurent T

Re: Problem using Spark with Hbase

2014-05-28 Thread Vibhor Banga
Any one who has used spark this way or has faced similar issue, please help. Thanks, -Vibhor On Wed, May 28, 2014 at 6:03 PM, Vibhor Banga vibhorba...@gmail.com wrote: Hi all, I am facing issues while using spark with HBase. I am getting NullPointerException at

RE: GraphX partition problem

2014-05-28 Thread Zhicharevich, Alex
Hi Ankur, We’ve built it from the git link you’ve sent, and we don’t get the exception anymore. However, we’ve been facing strange indeterministic behavior from Graphx. We compute connected components on a graph of ~900K edges. We ran the spark job several times on the same input graph and got

Re: Comprehensive Port Configuration reference?

2014-05-28 Thread Jacob Eisinger
Howdy Andrew, Here is what I ran before an application context was created (other services have been deleted): # netstat -l -t tcp -p --numeric-ports Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name

Re: Java RDD structure for Matrix predict?

2014-05-28 Thread Sandeep Parikh
Wisely, is mapToPair in Spark 0.9.1 or 1.0? I'm running the former and didn't see that method available. I think the issue is that predict() is expecting an RDD containing a tuple of ints and not Integers. So if I use JavaPairRDDObject,Object with my original code snippet, things seem to at least

Integration issue between Apache Shark-0.9.1 (with in-house hive-0.11) and pre-existing CDH4.6 HIVE-0.10 server

2014-05-28 Thread bijoy deb
Hi all, I have installed Apache Shark 0.9.1 on my machine which comes bundled with hive-0.11 version of hive jars.I am trying to integrate this with my pre-existing CDH-4.6 version of the Hive server which is of version 0.10.On pointing HIVE_HOME in spark-env.sh to the cloudera version of the

Re: rdd ordering gets scrambled

2014-05-28 Thread Michael Malak
Mohit Jaggi: A workaround is to use zipWithIndex (to appear in Spark 1.0, but if you're still on 0.9x you can swipe the code from  https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala  ), map it to (x = (x._2,x._1)) and then sortByKey.

K-NN by efficient sparse matrix product

2014-05-28 Thread Christian Jauvin
Hi, I'm new to Spark and Hadoop, and I'd like to know if the following problem is solvable in terms of Spark's primitives. To compute the K-nearest neighbours of a N-dimensional dataset, I can multiply my very large normalized sparse matrix by its transpose. As this yields all pairwise distance

Re: Spark Streaming RDD to Shark table

2014-05-28 Thread Chang Lim
OK...I needed to set the JVM class.path for the worker to find the fb class: env.put(SPARK_JAVA_OPTS, -Djava.class.path=/home/myInc/hive-0.9.0-bin/lib/libfb303.jar); Now I am seeing the following spark.httpBroadcast.uri error. What am I missing? java.util.NoSuchElementException:

Re: Re: spark table to hive table

2014-05-28 Thread Michael Armbrust
On Tue, May 27, 2014 at 6:08 PM, JaeBoo Jung itsjb.j...@samsung.com wrote: I already tried HiveContext as well as SqlContext. But it seems that Spark's HiveContext is not completely same as Apache Hive. For example, SQL like 'SELECT RANK() OVER(ORDER BY VAL1 ASC) FROM TEST LIMIT 10' works

A Standalone App in Scala: Standalone mode issues

2014-05-28 Thread jaranda
During the last few days I've been trying to deploy a Scala job to a standalone cluster (master + 4 workers) without much success, although it worked perfectly when launching it from the spark shell, that is, using the Scala REPL (pretty strange, this would mean my cluster config was actually

Re: K-NN by efficient sparse matrix product

2014-05-28 Thread Christian Jauvin
Thank you for your answer. Would you have by any chance some example code (even fragmentary) that I could study? On 28 May 2014 14:04, Tom Vacek minnesota...@gmail.com wrote: Maybe I should add: if you can hold the entire matrix in memory, then this is embarrassingly parallel. If not, then the

Re: Spark 1.0: slf4j version conflicts with pig

2014-05-28 Thread Ryan Compton
Remark, just including the jar built by sbt will produce the same error. i,.e this pig script will fail: REGISTER /usr/share/osi1/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar; edgeList0 = LOAD

Re: Invalid Class Exception

2014-05-28 Thread Suman Somasundar
On 5/27/2014 1:28 PM, Marcelo Vanzin wrote: On Tue, May 27, 2014 at 1:05 PM, Suman Somasundar suman.somasun...@oracle.com wrote: I am running this on a Solaris machine with logical partitions. All the partitions (workers) access the same Spark folder. Can you check whether you have multiple

Re: Spark 1.0: slf4j version conflicts with pig

2014-05-28 Thread Ryan Compton
posted a JIRA https://issues.apache.org/jira/browse/SPARK-1952 On Wed, May 28, 2014 at 1:14 PM, Ryan Compton compton.r...@gmail.com wrote: Remark, just including the jar built by sbt will produce the same error. i,.e this pig script will fail: REGISTER

Re: Spark Memory Bounds

2014-05-28 Thread Keith Simmons
Thanks! Sounds like my rough understanding was roughly right :) Definitely understand cached RDDs can add to the memory requirements. Luckily, like you mentioned, you can configure spark to flush that to disk and bound its total size in memory via spark.storage.memoryFraction, so I have a

Re: GraphX partition problem

2014-05-28 Thread Ankur Dave
I've been trying to reproduce this but I haven't succeeded so far. For example, on the web-Google https://snap.stanford.edu/data/web-Google.htmlgraph, I get the expected results both on v0.9.1-handle-empty-partitions and on master: // Load web-Google and run connected componentsimport

Re: Python, Spark and HBase

2014-05-28 Thread twizansk
Hi Nick, I finally got around to downloading and building the patch. I pulled the code from https://github.com/MLnick/spark-1/tree/pyspark-inputformats I am running on a CDH5 node. While the code in the CDH branch is different from spark master, I do believe that I have resolved any

Re: Python, Spark and HBase

2014-05-28 Thread Matei Zaharia
It sounds like you made a typo in the code — perhaps you’re trying to call self._jvm.PythonRDDnewAPIHadoopFile instead of self._jvm.PythonRDD.newAPIHadoopFile? There should be a dot before the new. Matei On May 28, 2014, at 5:25 PM, twizansk twiza...@gmail.com wrote: Hi Nick, I finally

Re: Checking spark cache percentage programatically. And how to clear cache.

2014-05-28 Thread Matei Zaharia
You can remove cached RDDs by calling unpersist() on them. You can also use SparkContext.getRDDStorageInfo to get info on cache usage, though this is a developer API so it may change in future versions. We will add a standard API eventually but this is just very closely tied to framework

Re: Python, Spark and HBase

2014-05-28 Thread twizansk
In my code I am not referencing PythonRDD or PythonRDDnewAPIHadoopFile at all. I am calling SparkContext.newAPIHadoopFile with: inputformat_class='org.apache.hadoop.hbase.mapreduce.TableInputFormat' key_class='org.apache.hadoop.hbase.io.ImmutableBytesWritable',

Re: Python, Spark and HBase

2014-05-28 Thread twizansk
The code which causes the error is: The code which causes the error is: sc = SparkContext(local, My App) rdd = sc.newAPIHadoopFile( name, 'org.apache.hadoop.hbase.mapreduce.TableInputFormat', 'org.apache.hadoop.hbase.io.ImmutableBytesWritable',

Spark Stand-alone mode job not starting (akka Connection refused)

2014-05-28 Thread T.J. Alumbaugh
I've been trying for several days now to get a Spark application running in stand-alone mode, as described here: http://spark.apache.org/docs/latest/spark-standalone.html I'm using pyspark, so I've been following the example here:

Re: Spark on an HPC setup

2014-05-28 Thread Jeremy Freeman
Hi Sid, We are successfully running Spark on an HPC, it works great. Here's info on our setup / approach. We have a cluster with 256 nodes running Scientific Linux 6.3 and scheduled by Univa Grid Engine. The environment also has a DDN GridScalar running GPFS and several EMC Isilon clusters

Re: Integration issue between Apache Shark-0.9.1 (with in-house hive-0.11) and pre-existing CDH4.6 HIVE-0.10 server

2014-05-28 Thread bijoy deb
Hi, My shark-env.sh is already pointing to the hadoop2 cluster: export HADOOP_HOME=/opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop Both the hadoop cluster as well as the embedded hadoop jars within Shark are of version 2.0.0. Any more suggestions please? Thanks On Wed, May 28,