Re: Cosine Similarity between documents - Rows

2017-11-27 Thread Ge, Yao (Y.)
You are essential doing document clustering. K-means will do it. You do have to 
specify the number of clusters up front.

Sent from Email+ secured by MobileIron




From: "Donni Khan" 
>
Date: Monday, November 27, 2017 at 7:27:33 AM
To: "user@spark.apache.org" 
>
Subject: Cosine Similarity between documents - Rows


I have spark job to compute the similarity between text documents:

RowMatrix rowMatrix = new RowMatrix(vectorsRDD.rdd());
CoordinateMatrix  rowsimilarity=rowMatrix.columnSimilarities(0.5);
JavaRDD entries = rowsimilarity.entries().toJavaRDD();

List list = entries.collect();

for(MatrixEntry s : list) System.out.println(s);

the MatrixEntry(i, j, value) represents the similarity between columns(let's 
say the features of documents).
But how can I show the similarity between rows?
suppose I have five documents Doc1, Doc5, We would like to show the 
similarity between all those documnts.
 How do I get that? any help?

Thank you
Donni


RE: Spark scala REPL - Unable to create sqlContext

2015-10-25 Thread Ge, Yao (Y.)
Thanks. I wonder why this is not widely reported in the user forum. The RELP 
shell is basically broken in 1.5 .0 and 1.5.1
-Yao

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Sunday, October 25, 2015 12:01 PM
To: Ge, Yao (Y.)
Cc: user
Subject: Re: Spark scala REPL - Unable to create sqlContext

Have you taken a look at the fix for SPARK-11000 which is in the upcoming 1.6.0 
release ?

Cheers

On Sun, Oct 25, 2015 at 8:42 AM, Yao <y...@ford.com<mailto:y...@ford.com>> 
wrote:
I have not been able to start Spark scala shell since 1.5 as it was not able
to create the sqlContext during the startup. It complains the metastore_db
is already locked: "Another instance of Derby may have already booted the
database". The Derby log is attached.

I only have this problem with starting the shell in yarn-client mode. I am
working with HDP2.2.6 which runs Hadoop 2.6.

-Yao derby.log
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n25195/derby.log>



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-scala-REPL-Unable-to-create-sqlContext-tp25195.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>



[SPARK-9776]Another instance of Derby may have already booted the database #8947

2015-10-23 Thread Ge, Yao (Y.)
I have not been able to run spark-shell in yarn-cluster mode since 1.5.0 due to 
the same issue described by [SPARK-9776]. Did this pull request fix the issue?
https://github.com/apache/spark/pull/8947
I still have the same problem with 1.5.1 (I am running on HDP 2.2.6 with Hadoop 
2.6)
Thanks.

-Yao



Decision Tree with libsvmtools datasets

2014-12-10 Thread Ge, Yao (Y.)
I am testing decision tree using iris.scale data set 
(http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#iris)
In the data set there are three class labels 1, 2, and 3. However in the 
following code, I have to make numClasses = 4. I will get an 
ArrayIndexOutOfBound Exception if I make the numClasses = 3. Why?

var conf = new SparkConf().setAppName(DecisionTree)
var sc = new SparkContext(conf)

val data = MLUtils.loadLibSVMFile(sc,data/iris.scale.txt);
val numClasses = 4;
val categoricalFeaturesInfo = Map[Int,Int]();
val impurity = gini;
val maxDepth = 5;
val maxBins = 100;

val model = DecisionTree.trainClassifier(data, numClasses, 
categoricalFeaturesInfo, impurity, maxDepth, maxBins);

val labelAndPreds = data.map{ point =
  val prediction = model.predict(point.features);
  (point.label, prediction)
}

val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble / 
data.count;
println(Training Error =  + trainErr);
println(Learned classification tree model:\n + model);

-Yao


Decision Tree with Categorical Features

2014-12-10 Thread Ge, Yao (Y.)
Can anyone provide an example code of using Categorical Features in Decision 
Tree?
Thanks!

-Yao


scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Ge, Yao (Y.)
I am working with Spark 1.1.0 and I believe Timestamp is a supported data type 
for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp 
when I try to use reflection to register a Java Bean with Timestamp field. 
Anything wrong with my code below?

public static class Event implements Serializable {
private String name;
private Timestamp time;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Timestamp getTime() {
return time;
}
public void setTime(Timestamp time) {
this.time = time;
}
}

@Test
public void testTimeStamp() {
JavaSparkContext sc = new 
JavaSparkContext(local, timestamp);
String[] data = {1,2014-01-01, 
2,2014-02-01};
JavaRDDString input = 
sc.parallelize(Arrays.asList(data));
JavaRDDEvent events = input.map(new 
FunctionString,Event() {
public Event call(String arg0) 
throws Exception {
String[] c = 
arg0.split(,);
Event e = new 
Event();
e.setName(c[0]);
DateFormat fmt 
= new SimpleDateFormat(-MM-dd);
e.setTime(new 
Timestamp(fmt.parse(c[1]).getTime()));
return e;
}
});

JavaSQLContext sqlCtx = new JavaSQLContext(sc);
JavaSchemaRDD schemaEvent = 
sqlCtx.applySchema(events, Event.class);
schemaEvent.registerTempTable(event);

sc.stop();
}


RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Ge, Yao (Y.)
scala.MatchError: class java.sql.Timestamp (of class java.lang.Class)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188)
at 
org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90)
at 
com.ford.dtc.ff.SessionStats.testTimeStamp(SessionStats.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)


From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com]
Sent: Sunday, October 19, 2014 10:31 AM
To: Ge, Yao (Y.); user@spark.apache.org
Subject: RE: scala.MatchError: class java.sql.Timestamp

Can you provide the exception stack?

Thanks,
Daoyuan

From: Ge, Yao (Y.) [mailto:y...@ford.com]
Sent: Sunday, October 19, 2014 10:17 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: scala.MatchError: class java.sql.Timestamp

I am working with Spark 1.1.0 and I believe Timestamp is a supported data type 
for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp 
when I try to use reflection to register a Java Bean with Timestamp field. 
Anything wrong with my code below?

public static class Event implements Serializable {
private String name;
private Timestamp time;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Timestamp getTime() {
return time;
}
public void setTime(Timestamp time) {
this.time = time;
}
}

@Test
public void testTimeStamp() {
JavaSparkContext sc = new

Exception Logging

2014-10-16 Thread Ge, Yao (Y.)
I need help to better trap Exception in the map functions. What is the best way 
to catch the exception and provide some helpful diagnostic information such as 
source of the input such as file name (and ideally line number if I am 
processing a text file)?

-Yao


RE: Dedup

2014-10-09 Thread Ge, Yao (Y.)
Yes. I was using String array as arguments in the reduceByKey. I think String 
array is actually immutable and simply returning the first argument without 
cloning one should work. I will look into mapPartitions as we can have up to 
40% duplicates. Will follow up on this if necessary. Thanks very much Sean!

-Yao  

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Thursday, October 09, 2014 3:04 AM
To: Ge, Yao (Y.)
Cc: user@spark.apache.org
Subject: Re: Dedup

I think the question is about copying the argument. If it's an immutable value 
like String, yes just return the first argument and ignore the second. If 
you're dealing with a notoriously mutable value like a Hadoop Writable, you 
need to copy the value you return.

This works fine although you will spend a fair bit of time marshaling all of 
those duplicates together just to discard all but one.

If there are lots of duplicates, It would take a bit more work, but would be 
faster, to do something like this: mapPartitions and retain one input value 
each unique dedup criteria, and then output those pairs, and then reduceByKey 
the result.

On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) y...@ford.com wrote:
 I need to do deduplication processing in Spark. The current plan is to 
 generate a tuple where key is the dedup criteria and value is the 
 original input. I am thinking to use reduceByKey to discard duplicate 
 values. If I do that, can I simply return the first argument or should 
 I return a copy of the first argument. Is there are better way to do dedup in 
 Spark?



 -Yao


Dedup

2014-10-08 Thread Ge, Yao (Y.)
I need to do deduplication processing in Spark. The current plan is to generate 
a tuple where key is the dedup criteria and value is the original input. I am 
thinking to use reduceByKey to discard duplicate values. If I do that, can I 
simply return the first argument or should I return a copy of the first 
argument. Is there are better way to do dedup in Spark?

-Yao


RE: KMeans - java.lang.IllegalArgumentException: requirement failed

2014-08-12 Thread Ge, Yao (Y.)
I figured it out. My indices parameters for the sparse vector are messed up. It 
is a good learning for me:
When use the Vectors.sparse(int size, int[] indices, double[] values) to 
generate a vector, size is the size of the whole vector, not just the size of 
the elements with value. The indices array will need to be in ascending order. 
In many cases, it probably easier to use other two forms of Vectors.sparse 
functions if the indices and value positions are not naturally sorted.

-Yao


From: Ge, Yao (Y.)
Sent: Monday, August 11, 2014 11:44 PM
To: 'u...@spark.incubator.apache.org'
Subject: KMeans - java.lang.IllegalArgumentException: requirement failed

I am trying to train a KMeans model with sparse vector with Spark 1.0.1.
When I run the training I got the following exception:
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:221)
at 
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:271)
at 
org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:398)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:372)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:366)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:366)
at 
org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:389)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:269)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:268)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:268)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:267)

What does this means? How do I troubleshoot this problem?
Thanks.

-Yao


KMeans - java.lang.IllegalArgumentException: requirement failed

2014-08-11 Thread Ge, Yao (Y.)
I am trying to train a KMeans model with sparse vector with Spark 1.0.1.
When I run the training I got the following exception:
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:221)
at 
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:271)
at 
org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:398)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:372)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:366)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:366)
at 
org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:389)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:269)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:268)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:268)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:267)

What does this means? How do I troubleshoot this problem?
Thanks.

-Yao