Are all those classes really needed for scala/spark? Seems like we should prune out non-dependencies if possible before we start changing the code. There are probably a lot of things that could be used with Mahout-Samsara but aren’t explicitly in it. Do we loose much by moving those to another module?
On May 19, 2015, at 11:24 AM, Andrew Musselman <[email protected]> wrote: Might not be terrible, I didn't look too hard but there are 97 instances of "com.google.common" in mahout-math and 4 in mahout-hdfs. On Tue, May 19, 2015 at 11:17 AM, Dmitriy Lyubimov <[email protected]> wrote: > PS assuming we clean mahout-math and scala modules -- this should be fairly > easy. Maybe there's some stuff in the colt classes but there shoulnd't be a > lot? > > > On Tue, May 19, 2015 at 11:16 AM, Dmitriy Lyubimov <[email protected]> > wrote: > >> can't we just declare its own guava for mahout-mr? Or inherit it from >> whenever it is declared in hadoop we depend on there? >> >> On Tue, May 19, 2015 at 9:24 AM, Pat Ferrel <[email protected]> > wrote: >> >>> I was hoping someone knew the differences. Andrew and I are feeling our >>> way along since we haven’t used either to any extent. >>> >>> On May 19, 2015, at 9:17 AM, Suneel Marthi <[email protected]> wrote: >>> >>> Ok, see ur point if its only for MAhout-Math and Mahout-hdfs. Not sure > if >>> its just straight replacement of Preconditions -> Asserts though. >>> Preconditions throw an exception if some condition is not satisfied. > Java >>> Asserts are never meant to be used in production code. >>> >>> So the right fix would be to replace all references to Preconditions > with >>> some exception handling boilerplate. >>> >>> On Tue, May 19, 2015 at 11:58 AM, Pat Ferrel <[email protected]> >>> wrote: >>> >>>> We only have to worry about mahout-math and mahout-hdfs. >>>> >>>> Yes, Andrew was working on those they were replaced with plain Java >>>> asserts. >>>> >>>> There still remain the uses you mention in those two modules but I see >>> no >>>> good alternative to hacking them out. Maybe we can move some code out > to >>>> mahout-mr if it’s easier. >>>> >>>> On May 19, 2015, at 8:48 AM, Suneel Marthi <[email protected]> > wrote: >>>> >>>> I had tried minimizing the Guava Dependency to a large extent in the >>> run up >>>> to 0.10.0. Its not as trivial as it seems, there are parts of the > code >>>> (Collocations, lucene2seq. Lucene TokenStream processing and >>> tokenization >>>> code) that are heavily reliant on AbstractIterator; and there are >>> sections >>>> of the code that assign a HashSet to a List (again have to use Guava > for >>>> that if one wants to avoid writing boilerplate for doing the same. >>>> >>>> Moreover, things that return something like Iterable<?> and need to be >>>> converted into a regular collection, can easily be done using Guava >>> without >>>> writing own boilerplate again. >>>> >>>> Are we replacing all Preconditions by straight Asserts now ?? >>>> >>>> >>>> On Tue, May 19, 2015 at 11:21 AM, Pat Ferrel <[email protected]> >>>> wrote: >>>> >>>>> We need to move to Spark 1.3 asap and set the stage for beyond 1.3. > The >>>>> primary reason is that the big distros are there already or will be >>> very >>>>> soon. Many people using Mahout will have the environment they must > use >>>>> dictated by support orgs in their companies so our current position > as >>>>> running only on Spark 1.1.1 means many potential users are out of > luck. >>>>> >>>>> Here are the problems I know of in moving Mahout ahead on Spark >>>>> 1) Guava in any backend code (executor closures) relies on being >>>>> serialized with Javaserializer, which is broken and hasn’t been fixed >>> in >>>>> 1.2+ There is a work around, which involves moving a Guava jar to all >>>> Spark >>>>> workers, which is unacceptable in many cases. Guava in the Spark-1.2 > PR >>>> has >>>>> been removed from Scala code and will be pushed to the master > probably >>>> this >>>>> week. That leaves a bunch of uses of Guava in java math and hdfs. >>> Andrew >>>>> has (I think) removed the Preconditions and replaced them with > asserts. >>>> But >>>>> there remain some uses of Map and AbstractIterator from Guava. Not > sure >>>> how >>>>> many of these remain but if anyone can help please check here: >>>>> https://issues.apache.org/jira/browse/MAHOUT-1708 < >>>>> https://issues.apache.org/jira/browse/MAHOUT-1708> >>>>> 2) the Mahout Shell relies on APIs not available in Spark 1.3. >>>>> 3) the api for writing to sequence files now requires implicit values >>>> that >>>>> are not available in the current code. I think Andy did a temp fix to >>>> write >>>>> to object files but this is probably nto what we want to release. >>>>> >>>>> I for one would dearly love to see Mahout 0.10.1 support Spark 1.3+. >>> and >>>>> soon. This is a call for help in cleaning these things up. Even with > no >>>> new >>>>> features the above things would make Mahout much more usable in > current >>>>> environments. >>>> >>>> >>> >>> >> >
