Re: cf/couccurence code

Pat Ferrel Mon, 30 Jun 2014 08:37:07 -0700

No inherent need.  The original question was about Spark dependencies brought 
up by Anand. Math and Cooccurrence are not dependent and anything that does 
file I/O is. Math-scala does not have spark in the pom, spark and the I/O and 
CLI stuff do. Speaking for Sebastian and Dmitriy (with some ignorance) I think 
the idea was to isolate things with Spark dependencies something like we did 
before with Hadoop.


Why split math-scala and core-scala? No reason other than blindly following the 
legacy structure. I can imagine writing code that needs math and cf but not I/O 
or at least not Spark I/O but we don’t have that now. All I/O is Spark specific 
even the Spark shell needs to read sequenceFiles using Spark’s I/O. 

Sebastian is the primary author of the cooccurrence algo and voted to move but 
didn’t say where. If we move to math-scala I’d vote we rename that to 
core-scala to make things obvious to consumers writing poms or using sbt.

On Jun 30, 2014, at 12:12 AM, Ted Dunning <[email protected]> wrote:

This makes reasonable sense.  The CF stuff does *use* math a fair but but
could be said not to *be* math in itself.

On the other hand, the core/math split in Mahout itself was motivated by
the need to isolate the Hadoop dependencies.  I am not clear that the same
is true here.  Is there an inherent need to separate these?




On Wed, Jun 25, 2014 at 8:52 AM, Pat Ferrel <[email protected]> wrote:

> Seems like the cf stuff as well as other algos that are consumers of
> “math-scala” but are not really math, should go in a new “core” project
> perhaps. If so the pom should probably be pretty similar to math-scala so
> that any Spark dependencies are noticed. Keeping them in a scala only
> sub-project might allow for some future use of the Scala builder—sbt, but
> that’s for another discussion.
> 
> Using the old naming conventions and adding the -scala would suggest a
> “core-scala” sub-project with a pom similar to math-scala
> 
> If there are no objections, I’ll do that. I’m not a maven expert though so
> someone may want to look at that when the PR comes in.
> 
> 
> On Jun 19, 2014, at 6:49 PM, Pat Ferrel <[email protected]> wrote:
> 
> What sub-project and package?
> 
> In general how do we want to handle new Scala code?
> 
> I’m putting Spark specific stuff like I/O and drivers in Spark and using a
> new “drivers” package. There was one called “driver” in mrlegacy. Do we
> want to follow the old Java packaging as much as possible? This may cause
> naming conflicts, right?
> 
> The only non-Spark specific Scala sub-project is math-scala. Is this where
> we want cf/cooccurrence?
> 
> Also how do we want to handle CLI drivers? Seems like we might have
> something like “mahout-spark itemsimilarity -i hdfs://...”
> 
> 
> On Jun 19, 2014, at 1:02 PM, Pat Ferrel <[email protected]> wrote:
> 
> Not sure if the previous mail got through
> 
> I'm in a car
> 
> No spark deps in cf/cooccurrence it can be moved
> 
> The deps are in I/O code in ItemSimilarityJob the subject of the pr just
> before your first email
> 
> Sorry for the confusion
> 
> Sent from my iPhone
> 
>> On Jun 19, 2014, at 12:06 PM, Anand Avati <[email protected]> wrote:
>> 
>> Pat,
>> I don't seem to find such spark specific code in cf.. cf code itself is
>> engine agnostic. But of course you need some engine to use it. Similar to
>> the distributed decomposition stuff in math-scala. They need some engine
> to
>> run them, but the code itself is engine agnostic and in math-scala. Am I
>> missing something basic here?
>> 
>> 
>>> On Thu, Jun 19, 2014 at 11:47 AM, Pat Ferrel <[email protected]>
> wrote:
>>> 
>>> Actually it has several Spark deps like having an SparkContext,
> SparkConf,
>>> and and rdd for file I/O
>>> Please look before you vote. I’ve been waving this flag for awhile—I/O
> is
>>> not engine neutral.
>>> 
>>> 
>>> On Jun 19, 2014, at 11:41 AM, Sebastian Schelter <[email protected]>
> wrote:
>>> 
>>> Hi Anand,
>>> 
>>> Yes, this should not contain anything spark-specific. +1 for moving it.
>>> 
>>> --sebastian
>>> 
>>> 
>>> 
>>>> On 06/19/2014 08:38 PM, Anand Avati wrote:
>>>> Hi Pat and others,
>>>> I see that cf/CooccuranceAnalysis.scala is currently under spark. Is
>>> there
>>>> a specific reason? I see that the code itself is completely spark
>>> agnostic.
>>>> I tried moving the code under
>>>> math-scala/src/main/scala/org/apache/mahout/math/cf/ with the following
>>>> trivial patch:
>>>> 
>>>> diff --git
>>> a/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
>>>> b/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
>>>> index ee44f90..bd20956 100644
>>>> ---
>>> a/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
>>>> +++
>>> b/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
>>>> @@ -22,7 +22,6 @@ import scalabindings._
>>>> import RLikeOps._
>>>> import drm._
>>>> import RLikeDrmOps._
>>>> -import org.apache.mahout.sparkbindings._
>>>> import scala.collection.JavaConversions._
>>>> import org.apache.mahout.math.stats.LogLikelihood
>>>> 
>>>> 
>>>> and it seems to work just fine. From what I see, this should work just
>>> fine
>>>> on H2O as well with no changes.. Why give up generality and make it
> spark
>>>> specific?
>>>> 
>>>> Thanks
>>> 
>>> 
>>> 
> 
> 
>

Re: cf/couccurence code

Reply via email to