Jcenter / bintray support for spark packages?

2015-06-10 Thread Hector Yee
Hi Spark devs,

Is it possible to add jcenter or bintray support for Spark packages?

I'm trying to add our artifact which is on jcenter

https://bintray.com/airbnb/aerosolve

but I noticed in Spark packages it only accepts Maven coordinates.

-- 
Yee Yang Li Hector
google.com/+HectorYee

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark/Mesos

2015-05-05 Thread Hector Yee
Speaking as a user of spark on mesos

Yes it appears that each app appears as a separate framework on the mesos
master

In fine grained mode the number of executors goes up and down vs fixed in
coarse.
I would not run fine grained mode on a large cluster as it can potentially
spin up a lot of executors and DDOS the mesos master.
In a shared environment coarse grained seems better behaved as you can cap
the number of executors with spark --conf spark.cores.max
and the executors stick around as opposed to growing and shrinking per
stage.


Re: Storing large data for MLlib machine learning

2015-04-01 Thread Hector Yee
I use Thrift and then base64 encode the binary and save it as text file
lines that are snappy or gzip encoded.

It makes it very easy to copy small chunks locally and play with subsets of
the data and not have dependencies on HDFS / hadoop for server stuff for
example.


On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Thanks, Evan. What do you think about Protobuf? Twitter has a library to
 manage protobuf files in hdfs https://github.com/twitter/elephant-bird


 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Thursday, March 26, 2015 2:34 PM
 To: Stephen Boesch
 Cc: Ulanov, Alexander; dev@spark.apache.org
 Subject: Re: Storing large data for MLlib machine learning

 On binary file formats - I looked at HDF5+Spark a couple of years ago and
 found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
 needed filenames as input, you couldn't pass it anything like an
 InputStream). I don't know if it has gotten any better.

 Parquet plays much more nicely and there are lots of spark-related
 projects using it already. Keep in mind that it's column-oriented which
 might impact performance - but basically you're going to want your features
 in a byte array and deser should be pretty straightforward.

 On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch java...@gmail.commailto:
 java...@gmail.com wrote:
 There are some convenience methods you might consider including:

MLUtils.loadLibSVMFile

 and   MLUtils.loadLabeledPoint

 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander alexander.ula...@hp.com
 mailto:alexander.ula...@hp.com:

  Hi,
 
  Could you suggest what would be the reasonable file format to store
  feature vector data for machine learning in Spark MLlib? Are there any
 best
  practices for Spark?
 
  My data is dense feature vectors with labels. Some of the requirements
 are
  that the format should be easy loaded/serialized, randomly accessible,
 with
  a small footprint (binary). I am considering Parquet, hdf5, protocol
 buffer
  (protobuf), but I have little to no experience with them, so any
  suggestions would be really appreciated.
 
  Best regards, Alexander
 




-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*


Re: over 10000 commits!

2015-03-06 Thread Hector Yee
Congrats!

On Thu, Mar 5, 2015 at 1:34 PM, shane knapp skn...@berkeley.edu wrote:

 WOOT!

 On Thu, Mar 5, 2015 at 1:26 PM, Reynold Xin r...@databricks.com wrote:

  We reached a new milestone today.
 
  https://github.com/apache/spark
 
 
  10,001 commits now. Congratulations to Xiangrui for making the 1th
  commit!
 




-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*


Re: [ANNOUNCE] Spark 1.2.0 Release Preview Posted

2014-11-20 Thread Hector Yee
I'm getting a lot of task lost with this build in a large mesos cluster.
Happens with both hash and sort shuffles.

14/11/20 18:08:38 WARN TaskSetManager: Lost task 9.1 in stage 1.0 (TID 897,
i-d4d6553a.inst.aws.airbnb.com): FetchFailed(null, shuffleId=1, mapId=-1,
reduceId=9, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 1
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386)
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:383)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:382)
at
org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:178)
at
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)


On Thu, Nov 20, 2014 at 7:42 AM, Nan Zhu zhunanmcg...@gmail.com wrote:

 BTW, this PR https://github.com/apache/spark/pull/2524 is related to a
 blocker level bug,

 and this is actually close to be merged (have been reviewed for several
 rounds)

 I would appreciated if anyone can continue the process,

 @mateiz

 --
 Nan Zhu
 http://codingcat.me


 On Thursday, November 20, 2014 at 10:17 AM, Corey Nolet wrote:

  I was actually about to post this myself- I have a complex join that
 could
  benefit from something like a GroupComparator vs having to do multiple
  grouyBy operations. This is probably the wrong thread for a full
 discussion
  on this but I didn't see a JIRA ticket for this or anything similar- any
  reasons why this would not make sense given Spark's design?
 
  On Thu, Nov 20, 2014 at 9:39 AM, Madhu ma...@madhu.com (mailto:
 ma...@madhu.com) wrote:
 
   Thanks Patrick.
  
   I've been testing some 1.2 features, looks good so far.
   I have some example code that I think will be helpful for certain
 MR-style
   use cases (secondary sort).
   Can I still add that to the 1.2 documentation, or is that frozen at
 this
   point?
  
  
  
   -
   --
   Madhu
   https://www.linkedin.com/in/msiddalingaiah
   --
   View this message in context:
  
 http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Spark-1-2-0-Release-Preview-Posted-tp9400p9449.html
   Sent from the Apache Spark Developers List mailing list archive at
   Nabble.com (http://Nabble.com).
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org (mailto:
 dev-unsubscr...@spark.apache.org)
   For additional commands, e-mail: dev-h...@spark.apache.org (mailto:
 dev-h...@spark.apache.org)
  
 
 
 
 





-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*


Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-20 Thread Hector Yee
I'm still seeing the fetch failed error and updated
https://issues.apache.org/jira/browse/SPARK-3633

On Thu, Nov 20, 2014 at 10:21 AM, Marcelo Vanzin van...@cloudera.com
wrote:

 +1 (non-binding)

 . ran simple things on spark-shell
 . ran jobs in yarn client  cluster modes, and standalone cluster mode

 On Wed, Nov 19, 2014 at 2:51 PM, Andrew Or and...@databricks.com wrote:
  Please vote on releasing the following candidate as Apache Spark version
  1.1.1.
 
  This release fixes a number of bugs in Spark 1.1.0. Some of the notable
 ones
  are
  - [SPARK-3426] Sort-based shuffle compression settings are incompatible
  - [SPARK-3948] Stream corruption issues in sort-based shuffle
  - [SPARK-4107] Incorrect handling of Channel.read() led to data
 truncation
  The full list is at http://s.apache.org/z9h and in the CHANGES.txt
 attached.
 
  Additionally, this candidate fixes two blockers from the previous RC:
  - [SPARK-4434] Cluster mode jar URLs are broken
  - [SPARK-4480][SPARK-4467] Too many open files exception from shuffle
 spills
 
  The tag to be voted on is v1.1.1-rc2 (commit 3693ae5d):
  http://s.apache.org/p8
 
  The release files, including signatures, digests, etc can be found at:
  http://people.apache.org/~andrewor14/spark-1.1.1-rc2/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/andrewor14.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1043/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~andrewor14/spark-1.1.1-rc2-docs/
 
  Please vote on releasing this package as Apache Spark 1.1.1!
 
  The vote is open until Saturday, November 22, at 23:00 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
  [ ] +1 Release this package as Apache Spark 1.1.1
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  Cheers,
  Andrew
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org



 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*


Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-20 Thread Hector Yee
I think it is a race condition caused by netty deactivating a channel while
it is active.
Switched to nio and it works fine
--conf spark.shuffle.blockTransferService=nio

On Thu, Nov 20, 2014 at 10:44 AM, Hector Yee hector@gmail.com wrote:

 I'm still seeing the fetch failed error and updated
 https://issues.apache.org/jira/browse/SPARK-3633

 On Thu, Nov 20, 2014 at 10:21 AM, Marcelo Vanzin van...@cloudera.com
 wrote:

 +1 (non-binding)

 . ran simple things on spark-shell
 . ran jobs in yarn client  cluster modes, and standalone cluster mode

 On Wed, Nov 19, 2014 at 2:51 PM, Andrew Or and...@databricks.com wrote:
  Please vote on releasing the following candidate as Apache Spark version
  1.1.1.
 
  This release fixes a number of bugs in Spark 1.1.0. Some of the notable
 ones
  are
  - [SPARK-3426] Sort-based shuffle compression settings are incompatible
  - [SPARK-3948] Stream corruption issues in sort-based shuffle
  - [SPARK-4107] Incorrect handling of Channel.read() led to data
 truncation
  The full list is at http://s.apache.org/z9h and in the CHANGES.txt
 attached.
 
  Additionally, this candidate fixes two blockers from the previous RC:
  - [SPARK-4434] Cluster mode jar URLs are broken
  - [SPARK-4480][SPARK-4467] Too many open files exception from shuffle
 spills
 
  The tag to be voted on is v1.1.1-rc2 (commit 3693ae5d):
  http://s.apache.org/p8
 
  The release files, including signatures, digests, etc can be found at:
  http://people.apache.org/~andrewor14/spark-1.1.1-rc2/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/andrewor14.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1043/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~andrewor14/spark-1.1.1-rc2-docs/
 
  Please vote on releasing this package as Apache Spark 1.1.1!
 
  The vote is open until Saturday, November 22, at 23:00 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
  [ ] +1 Release this package as Apache Spark 1.1.1
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  Cheers,
  Andrew
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org



 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




 --
 Yee Yang Li Hector http://google.com/+HectorYee
 *google.com/+HectorYee http://google.com/+HectorYee*




-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*


Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-20 Thread Hector Yee
This is whatever was in http://people.apache.org/~andrewor14/spark-1
.1.1-rc2/

On Thu, Nov 20, 2014 at 11:48 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hector, is this a comment on 1.1.1 or on the 1.2 preview?

 Matei

  On Nov 20, 2014, at 11:39 AM, Hector Yee hector@gmail.com wrote:
 
  I think it is a race condition caused by netty deactivating a channel
 while
  it is active.
  Switched to nio and it works fine
  --conf spark.shuffle.blockTransferService=nio
 
  On Thu, Nov 20, 2014 at 10:44 AM, Hector Yee hector@gmail.com
 wrote:
 
  I'm still seeing the fetch failed error and updated
  https://issues.apache.org/jira/browse/SPARK-3633
 
  On Thu, Nov 20, 2014 at 10:21 AM, Marcelo Vanzin van...@cloudera.com
  wrote:
 
  +1 (non-binding)
 
  . ran simple things on spark-shell
  . ran jobs in yarn client  cluster modes, and standalone cluster mode
 
  On Wed, Nov 19, 2014 at 2:51 PM, Andrew Or and...@databricks.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark
 version
  1.1.1.
 
  This release fixes a number of bugs in Spark 1.1.0. Some of the
 notable
  ones
  are
  - [SPARK-3426] Sort-based shuffle compression settings are
 incompatible
  - [SPARK-3948] Stream corruption issues in sort-based shuffle
  - [SPARK-4107] Incorrect handling of Channel.read() led to data
  truncation
  The full list is at http://s.apache.org/z9h and in the CHANGES.txt
  attached.
 
  Additionally, this candidate fixes two blockers from the previous RC:
  - [SPARK-4434] Cluster mode jar URLs are broken
  - [SPARK-4480][SPARK-4467] Too many open files exception from shuffle
  spills
 
  The tag to be voted on is v1.1.1-rc2 (commit 3693ae5d):
  http://s.apache.org/p8
 
  The release files, including signatures, digests, etc can be found at:
  http://people.apache.org/~andrewor14/spark-1.1.1-rc2/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/andrewor14.asc
 
  The staging repository for this release can be found at:
 
 https://repository.apache.org/content/repositories/orgapachespark-1043/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~andrewor14/spark-1.1.1-rc2-docs/
 
  Please vote on releasing this package as Apache Spark 1.1.1!
 
  The vote is open until Saturday, November 22, at 23:00 UTC and passes
 if
  a majority of at least 3 +1 PMC votes are cast.
  [ ] +1 Release this package as Apache Spark 1.1.1
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  Cheers,
  Andrew
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 
  --
  Marcelo
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 
 
  --
  Yee Yang Li Hector http://google.com/+HectorYee
  *google.com/+HectorYee http://google.com/+HectorYee*
 
 
 
 
  --
  Yee Yang Li Hector http://google.com/+HectorYee
  *google.com/+HectorYee http://google.com/+HectorYee*




-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*


Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-20 Thread Hector Yee
Whoops I must have used the 1.2 preview and mixed them up.

spark-shell -version shows  version 1.2.0

Will update the bug https://issues.apache.org/jira/browse/SPARK-4516 to 1.2

On Thu, Nov 20, 2014 at 11:59 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Ah, I see. But the spark.shuffle.blockTransferService property doesn't
 exist in 1.1 (AFAIK) -- what exactly are you doing to get this problem?

 Matei

 On Nov 20, 2014, at 11:50 AM, Hector Yee hector@gmail.com wrote:

 This is whatever was in http://people.apache.org/~andrewor14/spark-1
 .1.1-rc2/

 On Thu, Nov 20, 2014 at 11:48 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Hector, is this a comment on 1.1.1 or on the 1.2 preview?

 Matei

  On Nov 20, 2014, at 11:39 AM, Hector Yee hector@gmail.com wrote:
 
  I think it is a race condition caused by netty deactivating a channel
 while
  it is active.
  Switched to nio and it works fine
  --conf spark.shuffle.blockTransferService=nio
 
  On Thu, Nov 20, 2014 at 10:44 AM, Hector Yee hector@gmail.com
 wrote:
 
  I'm still seeing the fetch failed error and updated
  https://issues.apache.org/jira/browse/SPARK-3633
 
  On Thu, Nov 20, 2014 at 10:21 AM, Marcelo Vanzin van...@cloudera.com
  wrote:
 
  +1 (non-binding)
 
  . ran simple things on spark-shell
  . ran jobs in yarn client  cluster modes, and standalone cluster mode
 
  On Wed, Nov 19, 2014 at 2:51 PM, Andrew Or and...@databricks.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark
 version
  1.1.1.
 
  This release fixes a number of bugs in Spark 1.1.0. Some of the
 notable
  ones
  are
  - [SPARK-3426] Sort-based shuffle compression settings are
 incompatible
  - [SPARK-3948] Stream corruption issues in sort-based shuffle
  - [SPARK-4107] Incorrect handling of Channel.read() led to data
  truncation
  The full list is at http://s.apache.org/z9h and in the CHANGES.txt
  attached.
 
  Additionally, this candidate fixes two blockers from the previous RC:
  - [SPARK-4434] Cluster mode jar URLs are broken
  - [SPARK-4480][SPARK-4467] Too many open files exception from shuffle
  spills
 
  The tag to be voted on is v1.1.1-rc2 (commit 3693ae5d):
  http://s.apache.org/p8
 
  The release files, including signatures, digests, etc can be found
 at:
  http://people.apache.org/~andrewor14/spark-1.1.1-rc2/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/andrewor14.asc
 
  The staging repository for this release can be found at:
 
 https://repository.apache.org/content/repositories/orgapachespark-1043/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~andrewor14/spark-1.1.1-rc2-docs/
 
  Please vote on releasing this package as Apache Spark 1.1.1!
 
  The vote is open until Saturday, November 22, at 23:00 UTC and
 passes if
  a majority of at least 3 +1 PMC votes are cast.
  [ ] +1 Release this package as Apache Spark 1.1.1
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  Cheers,
  Andrew
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 
  --
  Marcelo
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 
 
  --
  Yee Yang Li Hector http://google.com/+HectorYee
  *google.com/+HectorYee http://google.com/+HectorYee*
 
 
 
 
  --
  Yee Yang Li Hector http://google.com/+HectorYee
  *google.com/+HectorYee http://google.com/+HectorYee*




 --
 Yee Yang Li Hector http://google.com/+HectorYee
 *google.com/+HectorYee http://google.com/+HectorYee*





-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
I would say for bigdata applications the most useful would be hierarchical
k-means with back tracking and the ability to support k nearest centroids.


On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,

 MLlib currently has one clustering algorithm implementation, KMeans.
 It would benefit from having implementations of other clustering
 algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
 Clustering, and Affinity Propagation.

 I recently submitted a PR [1] for a MiniBatch KMeans implementation,
 and I saw an email on this list about interest in implementing Fuzzy
 C-Means.

 Based on Sean Owen's review of my MiniBatch KMeans code, it became
 apparent that before I implement more clustering algorithms, it would
 be useful to hammer out a framework to reduce code duplication and
 implement a consistent API.

 I'd like to gauge the interest and goals of the MLlib community:

 1. Are you interested in having more clustering algorithms available?

 2. Is the community interested in specifying a common framework?

 Thanks!
 RJ

 [1] - https://github.com/apache/spark/pull/1248


 --
 em rnowl...@gmail.com
 c 954.496.2314




-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
No idea, never looked it up. Always just implemented it as doing k-means
again on each cluster.

FWIW standard k-means with euclidean distance has problems too with some
dimensionality reduction methods. Swapping out the distance metric with
negative dot or cosine may help.

Other more useful clustering would be hierarchical SVD. The reason why I
like hierarchical clustering is it makes for faster inference especially
over billions of users.


On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 Hector, could you share the references for hierarchical K-means? thanks.


 On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote:

  I would say for bigdata applications the most useful would be
 hierarchical
  k-means with back tracking and the ability to support k nearest
 centroids.
 
 
  On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote:
 
   Hi all,
  
   MLlib currently has one clustering algorithm implementation, KMeans.
   It would benefit from having implementations of other clustering
   algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
   Clustering, and Affinity Propagation.
  
   I recently submitted a PR [1] for a MiniBatch KMeans implementation,
   and I saw an email on this list about interest in implementing Fuzzy
   C-Means.
  
   Based on Sean Owen's review of my MiniBatch KMeans code, it became
   apparent that before I implement more clustering algorithms, it would
   be useful to hammer out a framework to reduce code duplication and
   implement a consistent API.
  
   I'd like to gauge the interest and goals of the MLlib community:
  
   1. Are you interested in having more clustering algorithms available?
  
   2. Is the community interested in specifying a common framework?
  
   Thanks!
   RJ
  
   [1] - https://github.com/apache/spark/pull/1248
  
  
   --
   em rnowl...@gmail.com
   c 954.496.2314
  
 
 
 
  --
  Yee Yang Li Hector http://google.com/+HectorYee
  *google.com/+HectorYee http://google.com/+HectorYee*
 




-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
K doesn't matter much I've tried anything from 2^10 to 10^3 and the
performance
doesn't change much as measured by precision @ K. (see table 1
http://machinelearning.wustl.edu/mlpapers/papers/weston13). Although 10^3
kmeans did outperform 2^10 hierarchical SVD slightly in terms of the
metrics, 2^10 SVD was much faster in terms of inference time.

I found the thing that affected performance most was adding in back
tracking to fix mistakes made at higher levels rather than how the K is
picked per level.



On Tue, Jul 8, 2014 at 1:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 sure. more interesting problem here is choosing k at each level. Kernel
 methods seem to be most promising.


 On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote:

  No idea, never looked it up. Always just implemented it as doing k-means
  again on each cluster.
 
  FWIW standard k-means with euclidean distance has problems too with some
  dimensionality reduction methods. Swapping out the distance metric with
  negative dot or cosine may help.
 
  Other more useful clustering would be hierarchical SVD. The reason why I
  like hierarchical clustering is it makes for faster inference especially
  over billions of users.
 
 
  On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
   Hector, could you share the references for hierarchical K-means?
 thanks.
  
  
   On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com
 wrote:
  
I would say for bigdata applications the most useful would be
   hierarchical
k-means with back tracking and the ability to support k nearest
   centroids.
   
   
On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
  wrote:
   
 Hi all,

 MLlib currently has one clustering algorithm implementation,
 KMeans.
 It would benefit from having implementations of other clustering
 algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
 Clustering, and Affinity Propagation.

 I recently submitted a PR [1] for a MiniBatch KMeans
 implementation,
 and I saw an email on this list about interest in implementing
 Fuzzy
 C-Means.

 Based on Sean Owen's review of my MiniBatch KMeans code, it became
 apparent that before I implement more clustering algorithms, it
 would
 be useful to hammer out a framework to reduce code duplication and
 implement a consistent API.

 I'd like to gauge the interest and goals of the MLlib community:

 1. Are you interested in having more clustering algorithms
 available?

 2. Is the community interested in specifying a common framework?

 Thanks!
 RJ

 [1] - https://github.com/apache/spark/pull/1248


 --
 em rnowl...@gmail.com
 c 954.496.2314

   
   
   
--
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*
   
  
 
 
 
  --
  Yee Yang Li Hector http://google.com/+HectorYee
  *google.com/+HectorYee http://google.com/+HectorYee*
 




-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
No was thinking more top-down:

assuming a distributed kmeans system already existing, recursively apply
the kmeans algorithm on data already partitioned by the previous level of
kmeans.

I haven't been much of a fan of bottom up approaches like HAC mainly
because they assume there is already a distance metric for items to items.
This makes it hard to cluster new content. The distances between sibling
clusters is also hard to compute (if you have thrown away the similarity
matrix), do you count paths to same parent node if you are computing
distances between items in two adjacent nodes for example. It is also a bit
harder to distribute the computation for bottom up approaches as you have
to already find the nearest neighbor to an item to begin the process.


On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote:

 The scikit-learn implementation may be of interest:


 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward

 It's a bottom up approach.  The pair of clusters for merging are
 chosen to minimize variance.

 Their code is under a BSD license so it can be used as a template.

 Is something like that you were thinking Hector?

 On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:
  sure. more interesting problem here is choosing k at each level. Kernel
  methods seem to be most promising.
 
 
  On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote:
 
  No idea, never looked it up. Always just implemented it as doing k-means
  again on each cluster.
 
  FWIW standard k-means with euclidean distance has problems too with some
  dimensionality reduction methods. Swapping out the distance metric with
  negative dot or cosine may help.
 
  Other more useful clustering would be hierarchical SVD. The reason why I
  like hierarchical clustering is it makes for faster inference especially
  over billions of users.
 
 
  On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
   Hector, could you share the references for hierarchical K-means?
 thanks.
  
  
   On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com
 wrote:
  
I would say for bigdata applications the most useful would be
   hierarchical
k-means with back tracking and the ability to support k nearest
   centroids.
   
   
On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
  wrote:
   
 Hi all,

 MLlib currently has one clustering algorithm implementation,
 KMeans.
 It would benefit from having implementations of other clustering
 algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
 Clustering, and Affinity Propagation.

 I recently submitted a PR [1] for a MiniBatch KMeans
 implementation,
 and I saw an email on this list about interest in implementing
 Fuzzy
 C-Means.

 Based on Sean Owen's review of my MiniBatch KMeans code, it became
 apparent that before I implement more clustering algorithms, it
 would
 be useful to hammer out a framework to reduce code duplication and
 implement a consistent API.

 I'd like to gauge the interest and goals of the MLlib community:

 1. Are you interested in having more clustering algorithms
 available?

 2. Is the community interested in specifying a common framework?

 Thanks!
 RJ

 [1] - https://github.com/apache/spark/pull/1248


 --
 em rnowl...@gmail.com
 c 954.496.2314

   
   
   
--
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*
   
  
 
 
 
  --
  Yee Yang Li Hector http://google.com/+HectorYee
  *google.com/+HectorYee http://google.com/+HectorYee*
 



 --
 em rnowl...@gmail.com
 c 954.496.2314




-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
Yeah if one were to replace the objective function in decision tree with
minimizing the variance of the leaf nodes it would be a hierarchical
clusterer.


On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:

 If you're thinking along these lines, have a look at the DecisionTree
 implementation in MLlib. It uses the same idea and is optimized to prevent
 multiple passes over the data by computing several splits at each level of
 tree building. The tradeoff is increased model state and computation per
 pass over the data, but fewer total passes and hopefully lower
 communication overheads than, say, shuffling data around that belongs to
 one cluster or another. Something like that could work here as well.

 I'm not super-familiar with hierarchical K-Means so perhaps there's a more
 efficient way to implement it, though.


 On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote:

  No was thinking more top-down:
 
  assuming a distributed kmeans system already existing, recursively apply
  the kmeans algorithm on data already partitioned by the previous level of
  kmeans.
 
  I haven't been much of a fan of bottom up approaches like HAC mainly
  because they assume there is already a distance metric for items to
 items.
  This makes it hard to cluster new content. The distances between sibling
  clusters is also hard to compute (if you have thrown away the similarity
  matrix), do you count paths to same parent node if you are computing
  distances between items in two adjacent nodes for example. It is also a
 bit
  harder to distribute the computation for bottom up approaches as you have
  to already find the nearest neighbor to an item to begin the process.
 
 
  On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote:
 
   The scikit-learn implementation may be of interest:
  
  
  
 
 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
  
   It's a bottom up approach.  The pair of clusters for merging are
   chosen to minimize variance.
  
   Their code is under a BSD license so it can be used as a template.
  
   Is something like that you were thinking Hector?
  
   On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
sure. more interesting problem here is choosing k at each level.
 Kernel
methods seem to be most promising.
   
   
On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com
  wrote:
   
No idea, never looked it up. Always just implemented it as doing
  k-means
again on each cluster.
   
FWIW standard k-means with euclidean distance has problems too with
  some
dimensionality reduction methods. Swapping out the distance metric
  with
negative dot or cosine may help.
   
Other more useful clustering would be hierarchical SVD. The reason
  why I
like hierarchical clustering is it makes for faster inference
  especially
over billions of users.
   
   
On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
 
wrote:
   
 Hector, could you share the references for hierarchical K-means?
   thanks.


 On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com
   wrote:

  I would say for bigdata applications the most useful would be
 hierarchical
  k-means with back tracking and the ability to support k nearest
 centroids.
 
 
  On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
 
wrote:
 
   Hi all,
  
   MLlib currently has one clustering algorithm implementation,
   KMeans.
   It would benefit from having implementations of other
 clustering
   algorithms such as MiniBatch KMeans, Fuzzy C-Means,
 Hierarchical
   Clustering, and Affinity Propagation.
  
   I recently submitted a PR [1] for a MiniBatch KMeans
   implementation,
   and I saw an email on this list about interest in implementing
   Fuzzy
   C-Means.
  
   Based on Sean Owen's review of my MiniBatch KMeans code, it
  became
   apparent that before I implement more clustering algorithms,
 it
   would
   be useful to hammer out a framework to reduce code duplication
  and
   implement a consistent API.
  
   I'd like to gauge the interest and goals of the MLlib
 community:
  
   1. Are you interested in having more clustering algorithms
   available?
  
   2. Is the community interested in specifying a common
 framework?
  
   Thanks!
   RJ
  
   [1] - https://github.com/apache/spark/pull/1248
  
  
   --
   em rnowl...@gmail.com
   c 954.496.2314
  
 
 
 
  --
  Yee Yang Li Hector http://google.com/+HectorYee
  *google.com/+HectorYee http://google.com/+HectorYee*
 

   
   
   
--
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*
   
  
  
  
   --
   em