Seems like the cf stuff as well as other algos that are consumers of “math-scala” but are not really math, should go in a new “core” project perhaps. If so the pom should probably be pretty similar to math-scala so that any Spark dependencies are noticed. Keeping them in a scala only sub-project might allow for some future use of the Scala builder—sbt, but that’s for another discussion.
Using the old naming conventions and adding the -scala would suggest a “core-scala” sub-project with a pom similar to math-scala If there are no objections, I’ll do that. I’m not a maven expert though so someone may want to look at that when the PR comes in. On Jun 19, 2014, at 6:49 PM, Pat Ferrel <[email protected]> wrote: What sub-project and package? In general how do we want to handle new Scala code? I’m putting Spark specific stuff like I/O and drivers in Spark and using a new “drivers” package. There was one called “driver” in mrlegacy. Do we want to follow the old Java packaging as much as possible? This may cause naming conflicts, right? The only non-Spark specific Scala sub-project is math-scala. Is this where we want cf/cooccurrence? Also how do we want to handle CLI drivers? Seems like we might have something like “mahout-spark itemsimilarity -i hdfs://...” On Jun 19, 2014, at 1:02 PM, Pat Ferrel <[email protected]> wrote: Not sure if the previous mail got through I'm in a car No spark deps in cf/cooccurrence it can be moved The deps are in I/O code in ItemSimilarityJob the subject of the pr just before your first email Sorry for the confusion Sent from my iPhone > On Jun 19, 2014, at 12:06 PM, Anand Avati <[email protected]> wrote: > > Pat, > I don't seem to find such spark specific code in cf.. cf code itself is > engine agnostic. But of course you need some engine to use it. Similar to > the distributed decomposition stuff in math-scala. They need some engine to > run them, but the code itself is engine agnostic and in math-scala. Am I > missing something basic here? > > >> On Thu, Jun 19, 2014 at 11:47 AM, Pat Ferrel <[email protected]> wrote: >> >> Actually it has several Spark deps like having an SparkContext, SparkConf, >> and and rdd for file I/O >> Please look before you vote. I’ve been waving this flag for awhile—I/O is >> not engine neutral. >> >> >> On Jun 19, 2014, at 11:41 AM, Sebastian Schelter <[email protected]> wrote: >> >> Hi Anand, >> >> Yes, this should not contain anything spark-specific. +1 for moving it. >> >> --sebastian >> >> >> >>> On 06/19/2014 08:38 PM, Anand Avati wrote: >>> Hi Pat and others, >>> I see that cf/CooccuranceAnalysis.scala is currently under spark. Is >> there >>> a specific reason? I see that the code itself is completely spark >> agnostic. >>> I tried moving the code under >>> math-scala/src/main/scala/org/apache/mahout/math/cf/ with the following >>> trivial patch: >>> >>> diff --git >> a/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala >>> b/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala >>> index ee44f90..bd20956 100644 >>> --- >> a/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala >>> +++ >> b/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala >>> @@ -22,7 +22,6 @@ import scalabindings._ >>> import RLikeOps._ >>> import drm._ >>> import RLikeDrmOps._ >>> -import org.apache.mahout.sparkbindings._ >>> import scala.collection.JavaConversions._ >>> import org.apache.mahout.math.stats.LogLikelihood >>> >>> >>> and it seems to work just fine. From what I see, this should work just >> fine >>> on H2O as well with no changes.. Why give up generality and make it spark >>> specific? >>> >>> Thanks >> >> >>
