No inherent need. The original question was about Spark dependencies brought up by Anand. Math and Cooccurrence are not dependent and anything that does file I/O is. Math-scala does not have spark in the pom, spark and the I/O and CLI stuff do. Speaking for Sebastian and Dmitriy (with some ignorance) I think the idea was to isolate things with Spark dependencies something like we did before with Hadoop.
Why split math-scala and core-scala? No reason other than blindly following the legacy structure. I can imagine writing code that needs math and cf but not I/O or at least not Spark I/O but we don’t have that now. All I/O is Spark specific even the Spark shell needs to read sequenceFiles using Spark’s I/O. Sebastian is the primary author of the cooccurrence algo and voted to move but didn’t say where. If we move to math-scala I’d vote we rename that to core-scala to make things obvious to consumers writing poms or using sbt. On Jun 30, 2014, at 12:12 AM, Ted Dunning <[email protected]> wrote: This makes reasonable sense. The CF stuff does *use* math a fair but but could be said not to *be* math in itself. On the other hand, the core/math split in Mahout itself was motivated by the need to isolate the Hadoop dependencies. I am not clear that the same is true here. Is there an inherent need to separate these? On Wed, Jun 25, 2014 at 8:52 AM, Pat Ferrel <[email protected]> wrote: > Seems like the cf stuff as well as other algos that are consumers of > “math-scala” but are not really math, should go in a new “core” project > perhaps. If so the pom should probably be pretty similar to math-scala so > that any Spark dependencies are noticed. Keeping them in a scala only > sub-project might allow for some future use of the Scala builder—sbt, but > that’s for another discussion. > > Using the old naming conventions and adding the -scala would suggest a > “core-scala” sub-project with a pom similar to math-scala > > If there are no objections, I’ll do that. I’m not a maven expert though so > someone may want to look at that when the PR comes in. > > > On Jun 19, 2014, at 6:49 PM, Pat Ferrel <[email protected]> wrote: > > What sub-project and package? > > In general how do we want to handle new Scala code? > > I’m putting Spark specific stuff like I/O and drivers in Spark and using a > new “drivers” package. There was one called “driver” in mrlegacy. Do we > want to follow the old Java packaging as much as possible? This may cause > naming conflicts, right? > > The only non-Spark specific Scala sub-project is math-scala. Is this where > we want cf/cooccurrence? > > Also how do we want to handle CLI drivers? Seems like we might have > something like “mahout-spark itemsimilarity -i hdfs://...” > > > On Jun 19, 2014, at 1:02 PM, Pat Ferrel <[email protected]> wrote: > > Not sure if the previous mail got through > > I'm in a car > > No spark deps in cf/cooccurrence it can be moved > > The deps are in I/O code in ItemSimilarityJob the subject of the pr just > before your first email > > Sorry for the confusion > > Sent from my iPhone > >> On Jun 19, 2014, at 12:06 PM, Anand Avati <[email protected]> wrote: >> >> Pat, >> I don't seem to find such spark specific code in cf.. cf code itself is >> engine agnostic. But of course you need some engine to use it. Similar to >> the distributed decomposition stuff in math-scala. They need some engine > to >> run them, but the code itself is engine agnostic and in math-scala. Am I >> missing something basic here? >> >> >>> On Thu, Jun 19, 2014 at 11:47 AM, Pat Ferrel <[email protected]> > wrote: >>> >>> Actually it has several Spark deps like having an SparkContext, > SparkConf, >>> and and rdd for file I/O >>> Please look before you vote. I’ve been waving this flag for awhile—I/O > is >>> not engine neutral. >>> >>> >>> On Jun 19, 2014, at 11:41 AM, Sebastian Schelter <[email protected]> > wrote: >>> >>> Hi Anand, >>> >>> Yes, this should not contain anything spark-specific. +1 for moving it. >>> >>> --sebastian >>> >>> >>> >>>> On 06/19/2014 08:38 PM, Anand Avati wrote: >>>> Hi Pat and others, >>>> I see that cf/CooccuranceAnalysis.scala is currently under spark. Is >>> there >>>> a specific reason? I see that the code itself is completely spark >>> agnostic. >>>> I tried moving the code under >>>> math-scala/src/main/scala/org/apache/mahout/math/cf/ with the following >>>> trivial patch: >>>> >>>> diff --git >>> a/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala >>>> b/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala >>>> index ee44f90..bd20956 100644 >>>> --- >>> a/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala >>>> +++ >>> b/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala >>>> @@ -22,7 +22,6 @@ import scalabindings._ >>>> import RLikeOps._ >>>> import drm._ >>>> import RLikeDrmOps._ >>>> -import org.apache.mahout.sparkbindings._ >>>> import scala.collection.JavaConversions._ >>>> import org.apache.mahout.math.stats.LogLikelihood >>>> >>>> >>>> and it seems to work just fine. From what I see, this should work just >>> fine >>>> on H2O as well with no changes.. Why give up generality and make it > spark >>>> specific? >>>> >>>> Thanks >>> >>> >>> > > >
