Re: matrix computation in spark
There's been some work at the AMPLab on a distributed matrix library on top of Spark; see here [1]. In particular, the repo contains a couple factorization algorithms. [1] https://github.com/amplab/ml-matrix Zongheng On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi liaoy...@huawei.com wrote: Hi, Matrix computation is critical for algorithm efficiency like least square, Kalman filter and so on. For now, the mllib module offers limited linear algebra on matrix, especially for distributed matrix. We have been working on establishing distributed matrix computation APIs based on data structures in MLlib. The main idea is to partition the matrix into sub-blocks, based on the strategy in the following paper. http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf In our experiment, it's communication-optimal. But operations like factorization may not be appropriate to carry out in blocks. Any suggestions and guidance are welcome. Thanks, Yuxi
Re: Matix operations in Scala \ Spark
We recently released a research prototype of a lightweight matrix library for Spark here: https://github.com/amplab/ml-matrix which does support norm and subtraction. Feel free to base your implementation on top of it. Zongheng On Sat, Oct 25, 2014 at 07:12 Xuefeng Wu ben...@gmail.com wrote: how about non/spire or twitter/scalding Yours, Xuefeng Wu 吴雪峰 敬上 On 2014年10月25日, at 下午9:03, salexln sale...@gmail.com wrote: Hi guys, I'm working on the implementation of the FuzzyCMeans algorithm (Jira https://issues.apache.org/jira/browse/SPARK-2344) and I need to use some operations on Matrices (norm subtraction) I could not find any Scala\ Spark Matrix class that will support these actions. Should I implement the Matrix as a two dimensional array and make my own code for the norm subtraction ? -- View this message in context: http://apache-spark- developers-list.1001551.n3.nabble.com/Matix-operations- in-Scala-Spark-tp8959.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Getting the execution times of spark job
For your second question: hql() (as well as sql()) does not launch a Spark job immediately; instead, it fires off the Spark SQL parser/optimizer/planner pipeline first, and a Spark job will be started after the a physical execution plan is selected. Therefore, your hand-rolled end-to-end measurement includes the time to go through the Spark SQL code path, and the times reported inside the UI are the execution times of the Spark job(s) only. On Mon, Sep 1, 2014 at 11:45 PM, Niranda Perera nira...@wso2.com wrote: Hi, I have been playing around with spark for a couple of days. I am using spark-1.0.1-bin-hadoop1 and the Java API. The main idea of the implementation is to run Hive queries on Spark. I used JavaHiveContext to achieve this (As per the examples). I have 2 questions. 1. I am wondering how I could get the execution times of a spark job? Does Spark provide monitoring facilities in the form of an API? 2. I used a laymen way to get the execution times by enclosing a JavaHiveContext.hql method with System.nanoTime() as follows long start, end; JavaHiveContext hiveCtx; JavaSchemaRDD hiveResult; start = System.nanoTime(); hiveResult = hiveCtx.hql(query); end = System.nanoTime(); System.out.println(start-end); But the result I got is drastically different from the execution times recorded in SparkUI. Can you please explain this disparity? Look forward to hearing from you. rgds -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: compilation error in Catalyst module
Hi Ted, By refreshing do you mean you have done 'mvn clean'? On Wed, Aug 6, 2014 at 1:17 PM, Ted Yu yuzhih...@gmail.com wrote: I refreshed my workspace. I got the following error with this command: mvn -Pyarn -Phive -Phadoop-2.4 -DskipTests install [ERROR] bad symbolic reference. A signature in package.class refers to term scalalogging in package com.typesafe which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling package.class. [ERROR] /homes/hortonzy/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/package.scala:36: bad symbolic reference. A signature in package.class refers to term slf4j in value com.typesafe.scalalogging which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling package.class. [ERROR] package object trees extends Logging { [ERROR] ^ [ERROR] two errors found Has anyone else seen the above ? Thanks - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: preferred Hive/Hadoop environment for generating golden test outputs
Hi Will, These three environment variables are needed [1]. I have had success with Hive 0.12 and Hadoop 1.0.4. For Hive, getting the source distribution seems to be required. Docs contribution will be much appreciated! [1] https://github.com/apache/spark/tree/master/sql#other-dependencies-for-developers Zongheng On Thu, Jul 17, 2014 at 7:51 PM, Will Benton wi...@redhat.com wrote: Hi all, What's the preferred environment for generating golden test outputs for new Hive tests? In particular: * what Hadoop version and Hive version should I be using, * are there particular distributions people have run successfully, and * are there any system properties or environment variables (beyond HADOOP_HOME, HIVE_HOME, and HIVE_DEV_HOME) I need to set before running the suite? I ask because I'm getting some errors while trying to add new tests and would like to eliminate any possible problems caused by differences between what my environment offers and what Spark expects. (I'm currently running with the Fedora packages for Hadoop 2.2.0 and a locally-built Hive 0.12.0.) Since I'll only be using this for generating test outputs, something as simple to set up as possible would be great. (Once I get something working, I'll be happy to write it up and contribute it as developer docs.) thanks, wb