Nearest neighbors in Spark with Annoy

2016-02-03 Thread apu mishra . rr
As mllib doesn't have nearest-neighbors functionality, I'm trying to use
Annoy  for Approximate Nearest Neighbors.
I try to broadcast the Annoy object and pass it to workers; however, it
does not operate as expected.

Below is complete code for reproducibility. The problem is highlighted in
the difference seen when using Annoy with vs without Spark.

from annoy import AnnoyIndex
import random
random.seed(42)

f = 40
t = AnnoyIndex(f)  # Length of item vector that will be indexed
allvectors = []
for i in xrange(20):
v = [random.gauss(0, 1) for z in xrange(f)]
t.add_item(i, v)
allvectors.append((i, v))
t.build(10) # 10 trees

# Use Annoy with Spark
sparkvectors = sc.parallelize(allvectors)
bct = sc.broadcast(t)
x = sparkvectors.map(lambda x: bct.value.get_nns_by_vector(vector=x[1],
n=5))
print "Five closest neighbors for first vector with Spark:",
print x.first()

# Use Annoy without Spark
print "Five closest neighbors for first vector without Spark:",
print(t.get_nns_by_vector(vector=allvectors[0][1], n=5))


Output seen:

Five closest neighbors for first vector with Spark: None

Five closest neighbors for first vector without Spark: [0, 13, 12, 6, 4]


Getting the size of a broadcast variable

2016-02-01 Thread apu mishra . rr
How can I determine the size (in bytes) of a broadcast variable? Do I need
to use the .dump method and then look at the size of the result, or is
there an easier way?

Using PySpark with Spark 1.6.

Thanks!

Apu


StackOverflowError when writing dataframe to table

2015-12-09 Thread apu mishra . rr
The command

mydataframe.write.saveAsTable(name="tablename")

sometimes results in java.lang.StackOverflowError (see below for fuller
error message).

This is after I am able to successfully run cache() and show() methods on
mydataframe.

The issue is not deterministic, i.e. the same code sometimes works fine,
sometimes not.

I am running PySpark with:

spark-submit --master local[*] --driver-memory 24g --executor-memory 24g

Any help understanding this issue would be appreciated!

Thanks, Apu

Fuller error message:

Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError

at
java.io.ObjectOutputStream$HandleTable.assign(ObjectOutputStream.java:2281)

at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1428)

at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)

at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)

at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)

at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)

at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)

at
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)

at
scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)

at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)

at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)

at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)

at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)

at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)

at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)

at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)

at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)

at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)

at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)

at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)

at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)

at
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)

at
scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)

at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)

at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)

at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)

at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)

at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)

at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)

at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)

at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)

at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)

at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)

at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)

at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)

at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)

at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)

at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)

at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)

at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)

at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)

at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)

at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)

at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)

at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)

at

Where does mllib's .save method save a model to?

2015-11-02 Thread apu mishra . rr
I want to save an mllib model to disk, and am trying the model.save
operation as described in
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#examples:

model.save(sc, "myModelPath")

But after running it, I am unable to find any newly created file or
dir by the name "myModelPath" in any obvious places. Any ideas where
it might lie?

Thanks, -Apu

To reproduce:

# In PySpark, create ALS or other mllib model, then
model.save(sc, "myModelPath")
# In Unix environment, try to find "myModelPath"

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org