This makes reasonable sense. The CF stuff does *use* math a fair but but could be said not to *be* math in itself.
On the other hand, the core/math split in Mahout itself was motivated by the need to isolate the Hadoop dependencies. I am not clear that the same is true here. Is there an inherent need to separate these? On Wed, Jun 25, 2014 at 8:52 AM, Pat Ferrel <[email protected]> wrote: > Seems like the cf stuff as well as other algos that are consumers of > “math-scala” but are not really math, should go in a new “core” project > perhaps. If so the pom should probably be pretty similar to math-scala so > that any Spark dependencies are noticed. Keeping them in a scala only > sub-project might allow for some future use of the Scala builder—sbt, but > that’s for another discussion. > > Using the old naming conventions and adding the -scala would suggest a > “core-scala” sub-project with a pom similar to math-scala > > If there are no objections, I’ll do that. I’m not a maven expert though so > someone may want to look at that when the PR comes in. > > > On Jun 19, 2014, at 6:49 PM, Pat Ferrel <[email protected]> wrote: > > What sub-project and package? > > In general how do we want to handle new Scala code? > > I’m putting Spark specific stuff like I/O and drivers in Spark and using a > new “drivers” package. There was one called “driver” in mrlegacy. Do we > want to follow the old Java packaging as much as possible? This may cause > naming conflicts, right? > > The only non-Spark specific Scala sub-project is math-scala. Is this where > we want cf/cooccurrence? > > Also how do we want to handle CLI drivers? Seems like we might have > something like “mahout-spark itemsimilarity -i hdfs://...” > > > On Jun 19, 2014, at 1:02 PM, Pat Ferrel <[email protected]> wrote: > > Not sure if the previous mail got through > > I'm in a car > > No spark deps in cf/cooccurrence it can be moved > > The deps are in I/O code in ItemSimilarityJob the subject of the pr just > before your first email > > Sorry for the confusion > > Sent from my iPhone > > > On Jun 19, 2014, at 12:06 PM, Anand Avati <[email protected]> wrote: > > > > Pat, > > I don't seem to find such spark specific code in cf.. cf code itself is > > engine agnostic. But of course you need some engine to use it. Similar to > > the distributed decomposition stuff in math-scala. They need some engine > to > > run them, but the code itself is engine agnostic and in math-scala. Am I > > missing something basic here? > > > > > >> On Thu, Jun 19, 2014 at 11:47 AM, Pat Ferrel <[email protected]> > wrote: > >> > >> Actually it has several Spark deps like having an SparkContext, > SparkConf, > >> and and rdd for file I/O > >> Please look before you vote. I’ve been waving this flag for awhile—I/O > is > >> not engine neutral. > >> > >> > >> On Jun 19, 2014, at 11:41 AM, Sebastian Schelter <[email protected]> > wrote: > >> > >> Hi Anand, > >> > >> Yes, this should not contain anything spark-specific. +1 for moving it. > >> > >> --sebastian > >> > >> > >> > >>> On 06/19/2014 08:38 PM, Anand Avati wrote: > >>> Hi Pat and others, > >>> I see that cf/CooccuranceAnalysis.scala is currently under spark. Is > >> there > >>> a specific reason? I see that the code itself is completely spark > >> agnostic. > >>> I tried moving the code under > >>> math-scala/src/main/scala/org/apache/mahout/math/cf/ with the following > >>> trivial patch: > >>> > >>> diff --git > >> a/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala > >>> b/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala > >>> index ee44f90..bd20956 100644 > >>> --- > >> a/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala > >>> +++ > >> b/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala > >>> @@ -22,7 +22,6 @@ import scalabindings._ > >>> import RLikeOps._ > >>> import drm._ > >>> import RLikeDrmOps._ > >>> -import org.apache.mahout.sparkbindings._ > >>> import scala.collection.JavaConversions._ > >>> import org.apache.mahout.math.stats.LogLikelihood > >>> > >>> > >>> and it seems to work just fine. From what I see, this should work just > >> fine > >>> on H2O as well with no changes.. Why give up generality and make it > spark > >>> specific? > >>> > >>> Thanks > >> > >> > >> > > >
