Re: Codebase refactoring proposal

Andrew Palumbo Wed, 04 Feb 2015 07:40:59 -0800


On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:

I'd suggest to consider this: remember all this talk about
language-integrated spark ql being basically dataframe manipulation DSL?


so now Spark devs are noticing this generality as well and are actually
proposing to rename SchemaRDD into DataFrame and make it mainstream data
structure. (my "told you so" moment of sorts :)

What i am getting at, i'd suggest to make DRM and Spark's newly renamed
DataFrame our two major structures. In particular, standardize on using
DataFrame for things that may include non-numerical data and require more
grace about column naming and manipulation. Maybe relevant to TF-IDF work
when it deals with non-matrix content.

Sounds like a worthy effort to me. We'd be basically implementing anAPI at the math-scala level for SchemaRDD/Dataframe datastructures correct?


 On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

Seems like seq2sparse would be really easy to replace since it takes text
files to start with, then the whole pipeline could be kept in rdds. The
dictionaries and counts could be either in-memory maps or rdds for use with
joins? This would get rid of sequence files completely from the pipeline.
Item similarity uses in-memory maps but the plan is to make it more
scalable using joins as an alternative with the same API allowing the user
to trade-off footprint for speed.

I think you're right- should be relatively easy. I've been looking atporting seq2sparse to the DSL for bit now and the stopper at the DSLlevel is that we don't have a distributed data structure forstrings..Seems like getting a DataFrame implemented as Dmitriy mentionedabove would take care of this problem.

The other issue i'm a little fuzzy on is the distributed collocationmapping- it's a part of the seq2sparse code that I've not spent toomuch time in.

I think that this would be very worthy effort as well- I believeseq2sparse is a particular strong mahout feature.

I'll start another thread since we're now way off topic from therefactoring proposal.


My use for TF-IDF is for row similarity and would take a DRM (actually
IndexedDataset) and calculate row/doc similarities. It works now but only
using LLR. This is OK when thinking of the items as tags or metadata but
for text tokens something like cosine may be better.

I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
like how CF preferences are downsampled. This would produce an sparsified
all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
terms before row similarity uses cosine. This is not so good for search but
should produce much better similarities than Solr’s “moreLikeThis” and does
it for all pairs rather than one at a time.

In any case it can be used to do a create a personalized content-based
recommender or augment a CF recommender with one more indicator type.

On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap....@outlook.com> wrote:


On 02/03/2015 12:44 PM, Andrew Palumbo wrote:

On 02/03/2015 12:22 PM, Pat Ferrel wrote:

Some issues WRT lower level Spark integration:
1) interoperability with Spark data. TF-IDF is one example I actually

looked at. There may be other things we can pick up from their committers
since they have an abundance.

2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to

me when someone on the Spark list asked about matrix transpose and an MLlib
committer’s answer was something like “why would you want to do that?”.
Usually you don’t actually execute the transpose but they don’t even
support A’A, AA’, or A’B, which are core to what I work on. At present you
pretty much have to choose between MLlib or Mahout for sparse matrix stuff.
Maybe a half-way measure is some implicit conversions (ugh, I know). If the
DSL could interchange datasets with MLlib, people would be pointed to the
DSL for all of a bunch of “why would you want to do that?” features. MLlib
seems to be algorithms, not math.

3) integration of Streaming. DStreams support most of the RDD

interface. Doing a batch recalc on a moving time window would nearly fall
out of DStream backed DRMs. This isn’t the same as incremental updates on
streaming but it’s a start.

Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink

faster compute engines. So we jumped. Now the need is for streaming and
especially incrementally updated streaming. Seems like we need to address
this.

Andrew, regardless of the above having TF-IDF would be super

helpful—row similarity for content/text would benefit greatly.

   I will put a PR up soon.

Just to clarify, I'll be porting over the (very simple) TF, TFIDF classes
and Weight interface over from mr-legacy to math-scala. They're available
now in spark-shell but won't be after this refactoring.  These still
require dictionary and a frequency count maps to vectorize incoming text-
so they're more for use with the old MR seq2sparse and I don't think they
can be used with Spark's HashingTF and IDF.  I'll put them up soon.
Hopefully they'll be of some use.

On Feb 3, 2015, at 8:47 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

But first I need to do massive fixes and improvements to the distributed
optimizer itself. Still waiting on green light for that.
On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <dlie...@gmail.com> wrote:

On Feb 3, 2015 7:20 AM, "Pat Ferrel" <p...@occamsmachete.com> wrote:

BTW what level of difficulty would making the DSL run on MLlib Vectors

and RowMatrix be? Looking at using their hashing TF-IDF but it raises
impedance mismatch between DRM and MLlib RowMatrix. This would further
reduce artifact size by a bunch.

Short answer, if it were possible, I'd not bother with Mahout code

base at

all. The problem is it lacks sufficient flexibility semantics and
abstruction. Breeze is indefinitely better in that department but at

the

time it was sufficiently worse on abstracting interoperability of

matrices

with different structures. And mllib does not expose breeze.

Looking forward toward hardware acellerated bolt-on work I just must

say

after reading breeze code for some time I still have much clearer plan

how

such back hybridization and cost calibration might work with current

Mahout

math abstractions than with breeze. It is also more in line with my

current

work tasks.

Also backing something like a DRM with DStreams. Periodic model recalc

with streams is maybe the first step towards truly streaming algos.

Looking

at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
similarity. Attach Kafka and get evergreen models, if not incrementally
updating models.

On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dlie...@gmail.com>

wrote:

bottom line compile-time dependencies are satisfied with no extra

stuff

from mr-legacy or its transitives. This is proven by virtue of

successful

compilation with no dependency on mr-legacy on the tree.

Runtime sufficiency for no extra dependency is proven via running

shell

or

embedded tests (unit tests) which are successful too. This implies
embedding and shell apis.

Issue with guava is typical one. if it were an issue, i wouldn't be

able

to

compile and/or run stuff. Now, question is what do we do if drivers

want

extra stuff that is not found in Spark.

Now, It is so nice not to depend on anything extra so i am hesitant to
offer anything  here. either shading or lib with opt-in dependency

policy

would suffice though, since it doesn't look like we'd have to have

tons

of

extra for drivers.



On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <p...@occamsmachete.com>

wrote:

I vaguely remember there being a Guava version problem where the

version

had to be rolled back in one of the hadoop modules. The math-scala
IndexedDataset shouldn’t care about version.

BTW It seems pretty easy to take out the option parser and replace

with

match and tuples especially if we can extend the Scala App class. It

might

actually simplify things since I can then use several case classes to

hold

options (scopt needed one object), which in turn takes out all those

ugly

casts. I’ll take a look next time I’m in there.

On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dlie...@gmail.com>

wrote:

in 'spark' module it is overwritten with spark dependency, which also

comes

at the same version so happens. so should be fine with 1.1.x

[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
mahout-spark_2.10 ---
[INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
[INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
[INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
[INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
[INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
[INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
[INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
[INFO] |  |  |  +- commons-logging:commons-logging:jar:1.1.3:compile
[INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
[INFO] |  |  |  +-
commons-configuration:commons-configuration:jar:1.6:compile
[INFO] |  |  |  |  +-
commons-collections:commons-collections:jar:3.2.1:compile
[INFO] |  |  |  |  +-

commons-digester:commons-digester:jar:1.8:compile

[INFO] |  |  |  |  |  \-
commons-beanutils:commons-beanutils:jar:1.7.0:compile
[INFO] |  |  |  |  \-
commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
[INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
[INFO] |  |  |  +-

com.google.protobuf:protobuf-java:jar:2.5.0:compile

[INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
[INFO] |  |  |  \-

org.apache.commons:commons-compress:jar:1.4.1:compile

[INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
[INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
[INFO] |  |  +-
org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
[INFO] |  |  |  +-
org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
[INFO] |  |  |  |  +-
org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
[INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
[INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
[INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
[INFO] |  |  |  |  |  +-

com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile

[INFO] |  |  |  |  |  |  +-

com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile

[INFO] |  |  |  |  |  |  |  +-
javax.servlet:javax.servlet-api:jar:3.0.1:compile
[INFO] |  |  |  |  |  |  |  \-

com.sun.jersey:jersey-client:jar:1.9:compile

[INFO] |  |  |  |  |  |  \-

com.sun.jersey:jersey-grizzly2:jar:1.9:compile

[INFO] |  |  |  |  |  |     +-
org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
[INFO] |  |  |  |  |  |     |  \-
org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
[INFO] |  |  |  |  |  |     |     \-
org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
[INFO] |  |  |  |  |  |     |        \-
org.glassfish.external:management-api:jar:3.0.0-b012:compile
[INFO] |  |  |  |  |  |     +-
org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
[INFO] |  |  |  |  |  |     |  \-
org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
[INFO] |  |  |  |  |  |     +-
org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
[INFO] |  |  |  |  |  |     \-

org.glassfish:javax.servlet:jar:3.1:compile

[INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
[INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
[INFO] |  |  |  |  |  |  \-

com.sun.jersey:jersey-core:jar:1.9:compile

[INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
[INFO] |  |  |  |  |  |  +-

org.codehaus.jettison:jettison:jar:1.1:compile

[INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
[INFO] |  |  |  |  |  |  +-

com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile

[INFO] |  |  |  |  |  |  |  \-

javax.xml.bind:jaxb-api:jar:2.2.2:compile

[INFO] |  |  |  |  |  |  |     \-
javax.activation:activation:jar:1.1:compile
[INFO] |  |  |  |  |  |  +-
org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
[INFO] |  |  |  |  |  |  \-
org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
[INFO] |  |  |  |  |  \-
com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
[INFO] |  |  |  |  \-
org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
[INFO] |  |  |  \-
org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
[INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
[INFO] |  |  +-
org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
[INFO] |  |  |  \-

org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile

[INFO] |  |  +-
org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
[INFO] |  |  \-

org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile

[INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
[INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
[INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
[INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
[INFO] |  |  +-

org.apache.curator:curator-framework:jar:2.4.0:compile

[INFO] |  |  |  \-

org.apache.curator:curator-client:jar:2.4.0:compile

[INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
[INFO] |  |     \- jline:jline:jar:0.9.94:compile
[INFO] |  +-

org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile

[INFO] |  |  +-

org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile

[INFO] |  |  +-

org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile

[INFO] |  |  |  +-

org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile

[INFO] |  |  |  \-
org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
[INFO] |  |  \-

org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile

[INFO] |  |     \-

org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile

[INFO] |  |        \-

org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile

[INFO] |  +-

org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile

[INFO] |  +-

org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile

[INFO] |  +-

org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile

[INFO] |  |  +-
org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
[INFO] |  |  +-
org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
[INFO] |  |  \-

org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile

[INFO] |  |     \-

org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile

[INFO] |  +- com.google.guava:guava:jar:16.0:compile
d

On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <dlie...@gmail.com
wrote:

looks like it is also requested by mahout-math, wonder what is using

it

there.

At very least, it needs to be synchronized to the one currently used

by

spark.

[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @

mahout-hadoop

---
[INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
*[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
[INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
*[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
[INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
[INFO] +-

org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test

[INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
[INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile


On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <p...@occamsmachete.com>

wrote:

Looks like Guava is in Spark.

On Jan 29, 2015, at 4:03 PM, Pat Ferrel <p...@occamsmachete.com>

wrote:

IndexedDataset uses Guava. Can’t tell from sure but it sounds like

this

would not be included since I think it was taken from the mrlegacy

jar.

On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <dlie...@gmail.com>

wrote:

---------- Forwarded message ----------
From: "Pat Ferrel" <p...@occamsmachete.com>
Date: Jan 25, 2015 9:39 AM
Subject: Re: Codebase refactoring proposal
To: <dev@mahout.apache.org>
Cc:

When you get a chance a PR would be good.

Yes, it would. And not just for that.

As I understand it you are putting some class jars somewhere in

the

classpath. Where? How?
/bin/mahout

(Computes 2 different classpaths. See  'bin/mahout classpath' vs.
'bin/mahout -spark'.)

If i interpret current shell code there correctky, legacy path

tries

to

use
examples assemblies if not packaged, or /lib if packaged. True

motivation

of that significantly predates 2010 and i suspect only Benson knows

whole

true intent there.

The spark path, which is really a quick hack of the script, tries

to

get

only selected mahout jars and locally instlalled spark classpath

which i

guess is just the shaded spark jar in recent spark releases. It

also

apparently tries to include /libs/*, which is never compiled in

unpackaged

version, and now i think it is a bug it is included  because

/libs/*

is

apparently legacy packaging, and shouldnt be used  in spark jobs

with a

wildcard. I cant beleive how lazy i am, i still did not find time

to

understand mahout build in all cases.

I am not even sure if packaged mahout will work with spark,

honestly,

because of the /lib. Never tried that, since i mostly use

application

embedding techniques.

The same solution may apply to adding external dependencies and

removing

the assembly in the Spark module. Which would leave only one major

build

issue afaik.

On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <dlie...@gmail.com

wrote:

No, no PR. Only experiment on private. But i believe i

sufficiently

defined

what i want to do in order to gauge if we may want to advance it

some

time

later. Goal is much lighter dependency for spark code. Eliminate

everything

that is not compile-time dependent. (and a lot of it is thru

legacy

MR

code

which we of course don't use).

Cant say i understand the remaining issues you are talking about

though.

If you are talking about compiling lib or shaded assembly, no,

this

doesn't

do anything about it. Although point is, as it stands, the algebra

and

shell don't have any external dependencies but spark and these 4

(5?)

mahout jars so they technically don't even need an assembly (as
demonstrated).

As i said, it seems driver code is the only one that may need some

external

dependencies, but that's a different scenario from those i am

talking

about. But i am relatively happy with having the first two working

nicely

at this point.

On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <

p...@occamsmachete.com>

wrote:

+1

Is there a PR? You mention a "tiny mahout-hadoop” module. It

would

be

nice

to see how you’ve structured that in case we can use the same

model to

solve the two remaining refactoring issues.
1) external dependencies in the spark module
2) no spark or h2o in the release artifacts.

On Jan 23, 2015, at 6:45 PM, Shannon Quinn <squ...@gatech.edu>

wrote:

Also +1

iPhone'd

On Jan 23, 2015, at 18:38, Andrew Palumbo <ap....@outlook.com>

wrote:

+1


Sent from my Verizon Wireless 4G LTE smartphone

<div>-------- Original message --------</div><div>From: Dmitriy

Lyubimov

<dlie...@gmail.com> </div><div>Date:01/23/2015  6:06 PM

(GMT-05:00)

</div><div>To: dev@mahout.apache.org </div><div>Subject:

Codebase

refactoring proposal </div><div>

</div>
So right now mahout-spark depends on mr-legacy.
I did quick refactoring and it turns out it only _irrevocably_

depends

on

the following classes there:

MatrixWritable, VectorWritable, Varint/Varlong and

VarintWritable,

and

...

*sigh* o.a.m.common.Pair

So  I just dropped those five classes into new a new tiny

mahout-hadoop

module (to signify stuff that is directly relevant to

serializing

thigns

to

DFS API) and completely removed mrlegacy and its transients from

spark

and

spark-shell dependencies.

So non-cli applications (shell scripts and embedded api use)

actually

only

need spark dependencies (which come from SPARK_HOME classpath,

of

course)

and mahout jars (mahout-spark, mahout-math(-scala),

mahout-hadoop

and

optionally mahout-spark-shell (for running shell)).

This of course still doesn't address driver problems that want

to

throw

more stuff into front-end classpath (such as cli parser) but at

least

it

renders transitive luggage of mr-legacy (and the size of

worker-shipped

jars) much more tolerable.

How does that sound?

Re: Codebase refactoring proposal

Reply via email to