Re: matrix computation in spark

2014-11-17 Thread Zongheng Yang
There's been some work at the AMPLab on a distributed matrix library on top
of Spark; see here [1]. In particular, the repo contains a couple
factorization algorithms.

[1] https://github.com/amplab/ml-matrix

Zongheng

On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi liaoy...@huawei.com wrote:

 Hi,
 Matrix computation is critical for algorithm efficiency like least square,
 Kalman filter and so on.
 For now, the mllib module offers limited linear algebra on matrix,
 especially for distributed matrix.

 We have been working on establishing distributed matrix computation APIs
 based on data structures in MLlib.
 The main idea is to partition the matrix into sub-blocks, based on the
 strategy in the following paper.
 http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf
 In our experiment, it's communication-optimal.
 But operations like factorization may not be appropriate to carry out in
 blocks.

 Any suggestions and guidance are welcome.

 Thanks,
 Yuxi




Re: Matix operations in Scala \ Spark

2014-10-25 Thread Zongheng Yang
We recently released a research prototype of a lightweight matrix library
for Spark here: https://github.com/amplab/ml-matrix which does support norm
and subtraction. Feel free to base your implementation on top of it.

Zongheng
On Sat, Oct 25, 2014 at 07:12 Xuefeng Wu ben...@gmail.com wrote:

 how about non/spire or twitter/scalding


 Yours, Xuefeng Wu 吴雪峰 敬上

  On 2014年10月25日, at 下午9:03, salexln sale...@gmail.com wrote:
 
  Hi guys,
 
  I'm working on the implementation of the FuzzyCMeans algorithm (Jira
  https://issues.apache.org/jira/browse/SPARK-2344)
  and I need to use some operations on Matrices (norm  subtraction)
 
  I could not find any Scala\ Spark Matrix class that will support these
  actions.
 
  Should I implement the Matrix as a two dimensional array and make my own
  code for the norm  subtraction ?
 
 
 
 
 
 
 
  --
  View this message in context: http://apache-spark-
 developers-list.1001551.n3.nabble.com/Matix-operations-
 in-Scala-Spark-tp8959.html
  Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Getting the execution times of spark job

2014-09-02 Thread Zongheng Yang
For your second question: hql() (as well as sql()) does not launch a
Spark job immediately; instead, it fires off the Spark SQL
parser/optimizer/planner pipeline first, and a Spark job will be
started after the a physical execution plan is selected. Therefore,
your hand-rolled end-to-end measurement includes the time to go
through the Spark SQL code path, and the times reported inside the UI
are the execution times of the Spark job(s) only.

On Mon, Sep 1, 2014 at 11:45 PM, Niranda Perera nira...@wso2.com wrote:
 Hi,

 I have been playing around with spark for a couple of days. I am
 using spark-1.0.1-bin-hadoop1 and the Java API. The main idea of the
 implementation is to run Hive queries on Spark. I used JavaHiveContext to
 achieve this (As per the examples).

 I have 2 questions.
 1. I am wondering how I could get the execution times of a spark job? Does
 Spark provide monitoring facilities in the form of an API?

 2. I used a laymen way to get the execution times by enclosing a
 JavaHiveContext.hql method with System.nanoTime() as follows

 long start, end;
 JavaHiveContext hiveCtx;
 JavaSchemaRDD hiveResult;

 start = System.nanoTime();
 hiveResult = hiveCtx.hql(query);
 end = System.nanoTime();
 System.out.println(start-end);

 But the result I got is drastically different from the execution times
 recorded in SparkUI. Can you please explain this disparity?

 Look forward to hearing from you.

 rgds

 --
 *Niranda Perera*
 Software Engineer, WSO2 Inc.
 Mobile: +94-71-554-8430
 Twitter: @n1r44 https://twitter.com/N1R44

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: compilation error in Catalyst module

2014-08-06 Thread Zongheng Yang
Hi Ted,

By refreshing do you mean you have done 'mvn clean'?

On Wed, Aug 6, 2014 at 1:17 PM, Ted Yu yuzhih...@gmail.com wrote:
 I refreshed my workspace.
 I got the following error with this command:

 mvn -Pyarn -Phive -Phadoop-2.4 -DskipTests install

 [ERROR] bad symbolic reference. A signature in package.class refers to term
 scalalogging
 in package com.typesafe which is not available.
 It may be completely missing from the current classpath, or the version on
 the classpath might be incompatible with the version used when compiling
 package.class.
 [ERROR]
 /homes/hortonzy/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/package.scala:36:
 bad symbolic reference. A signature in package.class refers to term slf4j
 in value com.typesafe.scalalogging which is not available.
 It may be completely missing from the current classpath, or the version on
 the classpath might be incompatible with the version used when compiling
 package.class.
 [ERROR] package object trees extends Logging {
 [ERROR]  ^
 [ERROR] two errors found

 Has anyone else seen the above ?

 Thanks

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: preferred Hive/Hadoop environment for generating golden test outputs

2014-07-17 Thread Zongheng Yang
Hi Will,

These three environment variables are needed [1].

I have had success with Hive 0.12 and Hadoop 1.0.4. For Hive, getting
the source distribution seems to be required. Docs contribution will
be much appreciated!

[1] 
https://github.com/apache/spark/tree/master/sql#other-dependencies-for-developers

Zongheng

On Thu, Jul 17, 2014 at 7:51 PM, Will Benton wi...@redhat.com wrote:
 Hi all,

 What's the preferred environment for generating golden test outputs for new 
 Hive tests?  In particular:

 * what Hadoop version and Hive version should I be using,
 * are there particular distributions people have run successfully, and
 * are there any system properties or environment variables (beyond 
 HADOOP_HOME, HIVE_HOME, and HIVE_DEV_HOME) I need to set before running the 
 suite?

 I ask because I'm getting some errors while trying to add new tests and would 
 like to eliminate any possible problems caused by differences between what my 
 environment offers and what Spark expects.  (I'm currently running with the 
 Fedora packages for Hadoop 2.2.0 and a locally-built Hive 0.12.0.)  Since 
 I'll only be using this for generating test outputs, something as simple to 
 set up as possible would be great.

 (Once I get something working, I'll be happy to write it up and contribute it 
 as developer docs.)


 thanks,
 wb