Re: Beyond Spark 1.1.1

Pat Ferrel Tue, 19 May 2015 16:48:56 -0700

Are all those classes really needed for scala/spark? Seems like we should prune 
out non-dependencies if possible before we start changing the code. There are 
probably a lot of things that could be used with Mahout-Samsara but aren’t 
explicitly in it. Do we loose much by moving those to another module?


On May 19, 2015, at 11:24 AM, Andrew Musselman <[email protected]> 
wrote:

Might not be terrible, I didn't look too hard but there are 97 instances of
"com.google.common" in mahout-math and 4 in mahout-hdfs.

On Tue, May 19, 2015 at 11:17 AM, Dmitriy Lyubimov <[email protected]>
wrote:

> PS assuming we clean mahout-math and scala modules -- this should be fairly
> easy. Maybe there's some stuff in the colt classes but there shoulnd't be a
> lot?
> 
> 
> On Tue, May 19, 2015 at 11:16 AM, Dmitriy Lyubimov <[email protected]>
> wrote:
> 
>> can't we just declare its own guava for mahout-mr? Or inherit it from
>> whenever it is declared in hadoop we depend on there?
>> 
>> On Tue, May 19, 2015 at 9:24 AM, Pat Ferrel <[email protected]>
> wrote:
>> 
>>> I was hoping someone knew the differences. Andrew and I are feeling our
>>> way along since we haven’t used either to any extent.
>>> 
>>> On May 19, 2015, at 9:17 AM, Suneel Marthi <[email protected]> wrote:
>>> 
>>> Ok, see ur point if its only for MAhout-Math and Mahout-hdfs.  Not sure
> if
>>> its just straight replacement of Preconditions -> Asserts though.
>>> Preconditions throw an exception if some condition is not satisfied.
> Java
>>> Asserts are never meant to be used in production code.
>>> 
>>> So the right fix would be to replace all references to Preconditions
> with
>>> some exception handling boilerplate.
>>> 
>>> On Tue, May 19, 2015 at 11:58 AM, Pat Ferrel <[email protected]>
>>> wrote:
>>> 
>>>> We only have to worry about mahout-math and mahout-hdfs.
>>>> 
>>>> Yes, Andrew was working on those they were replaced with plain Java
>>>> asserts.
>>>> 
>>>> There still remain the uses you mention in those two modules but I see
>>> no
>>>> good alternative to hacking them out. Maybe we can move some code out
> to
>>>> mahout-mr if it’s easier.
>>>> 
>>>> On May 19, 2015, at 8:48 AM, Suneel Marthi <[email protected]>
> wrote:
>>>> 
>>>> I had tried minimizing the Guava Dependency to a large extent in the
>>> run up
>>>> to 0.10.0.  Its not as trivial as it seems, there are parts of the
> code
>>>> (Collocations, lucene2seq. Lucene TokenStream processing and
>>> tokenization
>>>> code) that are heavily reliant on AbstractIterator;  and there are
>>> sections
>>>> of the code that assign a HashSet to a List (again have to use Guava
> for
>>>> that if one wants to avoid writing boilerplate for doing the same.
>>>> 
>>>> Moreover, things that return something like Iterable<?> and need to be
>>>> converted into a regular collection, can easily be done using Guava
>>> without
>>>> writing own boilerplate again.
>>>> 
>>>> Are we replacing all Preconditions by straight Asserts now ??
>>>> 
>>>> 
>>>> On Tue, May 19, 2015 at 11:21 AM, Pat Ferrel <[email protected]>
>>>> wrote:
>>>> 
>>>>> We need to move to Spark 1.3 asap and set the stage for beyond 1.3.
> The
>>>>> primary reason is that the big distros are there already or will be
>>> very
>>>>> soon. Many people using Mahout will have the environment they must
> use
>>>>> dictated by support orgs in their companies so our current position
> as
>>>>> running only on Spark 1.1.1 means many potential users are out of
> luck.
>>>>> 
>>>>> Here are the problems I know of in moving Mahout ahead on Spark
>>>>> 1) Guava in any backend code (executor closures) relies on being
>>>>> serialized with Javaserializer, which is broken and hasn’t been fixed
>>> in
>>>>> 1.2+ There is a work around, which involves moving a Guava jar to all
>>>> Spark
>>>>> workers, which is unacceptable in many cases. Guava in the Spark-1.2
> PR
>>>> has
>>>>> been removed from Scala code and will be pushed to the master
> probably
>>>> this
>>>>> week. That leaves a bunch of uses of Guava in java math and hdfs.
>>> Andrew
>>>>> has (I think) removed the Preconditions and replaced them with
> asserts.
>>>> But
>>>>> there remain some uses of Map and AbstractIterator from Guava. Not
> sure
>>>> how
>>>>> many of these remain but if anyone can help please check here:
>>>>> https://issues.apache.org/jira/browse/MAHOUT-1708 <
>>>>> https://issues.apache.org/jira/browse/MAHOUT-1708>
>>>>> 2) the Mahout Shell relies on APIs not available in Spark 1.3.
>>>>> 3) the api for writing to sequence files now requires implicit values
>>>> that
>>>>> are not available in the current code. I think Andy did a temp fix to
>>>> write
>>>>> to object files but this is probably nto what we want to release.
>>>>> 
>>>>> I for one would dearly love to see Mahout 0.10.1 support Spark 1.3+.
>>> and
>>>>> soon. This is a call for help in cleaning these things up. Even with
> no
>>>> new
>>>>> features the above things would make Mahout much more usable in
> current
>>>>> environments.
>>>> 
>>>> 
>>> 
>>> 
>> 
>

Re: Beyond Spark 1.1.1

Reply via email to