Re: Codebase refactoring proposal

Dmitriy Lyubimov Tue, 03 Feb 2015 08:48:12 -0800

But first I need to do massive fixes and improvements to the distributed
optimizer itself. Still waiting on green light for that.
On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <dlie...@gmail.com> wrote:


>
> On Feb 3, 2015 7:20 AM, "Pat Ferrel" <p...@occamsmachete.com> wrote:
> >
> > BTW what level of difficulty would making the DSL run on MLlib Vectors
> and RowMatrix be? Looking at using their hashing TF-IDF but it raises
> impedance mismatch between DRM and MLlib RowMatrix. This would further
> reduce artifact size by a bunch.
>
> Short answer, if it were possible, I'd not bother with Mahout code base at
> all. The problem is it lacks sufficient flexibility semantics and
> abstruction. Breeze is indefinitely better in that department but at the
> time it was sufficiently worse on abstracting interoperability of matrices
> with different structures. And mllib does not expose breeze.
>
> Looking forward toward hardware acellerated bolt-on work I just must say
> after reading breeze code for some time I still have much clearer plan how
> such back hybridization and cost calibration might work with current Mahout
> math abstractions than with breeze. It is also more in line with my current
> work tasks.
>
> >
> > Also backing something like a DRM with DStreams. Periodic model recalc
> with streams is maybe the first step towards truly streaming algos. Looking
> at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
> similarity. Attach Kafka and get evergreen models, if not incrementally
> updating models.
> >
> > On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> >
> > bottom line compile-time dependencies are satisfied with no extra stuff
> > from mr-legacy or its transitives. This is proven by virtue of
> successful
> > compilation with no dependency on mr-legacy on the tree.
> >
> > Runtime sufficiency for no extra dependency is proven via running shell
> or
> > embedded tests (unit tests) which are successful too. This implies
> > embedding and shell apis.
> >
> > Issue with guava is typical one. if it were an issue, i wouldn't be able
> to
> > compile and/or run stuff. Now, question is what do we do if drivers want
> > extra stuff that is not found in Spark.
> >
> > Now, It is so nice not to depend on anything extra so i am hesitant to
> > offer anything  here. either shading or lib with opt-in dependency policy
> > would suffice though, since it doesn't look like we'd have to have tons
> of
> > extra for drivers.
> >
> >
> >
> > On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <p...@occamsmachete.com>
> wrote:
> >
> > > I vaguely remember there being a Guava version problem where the
> version
> > > had to be rolled back in one of the hadoop modules. The math-scala
> > > IndexedDataset shouldn’t care about version.
> > >
> > > BTW It seems pretty easy to take out the option parser and replace with
> > > match and tuples especially if we can extend the Scala App class. It
> might
> > > actually simplify things since I can then use several case classes to
> hold
> > > options (scopt needed one object), which in turn takes out all those
> ugly
> > > casts. I’ll take a look next time I’m in there.
> > >
> > > On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
> > >
> > > in 'spark' module it is overwritten with spark dependency, which also
> comes
> > > at the same version so happens. so should be fine with 1.1.x
> > >
> > > [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> > > mahout-spark_2.10 ---
> > > [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
> > > [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
> > > [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> > > [INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> > > [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
> > > [INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
> > > [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
> > > [INFO] |  |  |  +- commons-logging:commons-logging:jar:1.1.3:compile
> > > [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
> > > [INFO] |  |  |  +-
> > > commons-configuration:commons-configuration:jar:1.6:compile
> > > [INFO] |  |  |  |  +-
> > > commons-collections:commons-collections:jar:3.2.1:compile
> > > [INFO] |  |  |  |  +- commons-digester:commons-digester:jar:1.8:compile
> > > [INFO] |  |  |  |  |  \-
> > > commons-beanutils:commons-beanutils:jar:1.7.0:compile
> > > [INFO] |  |  |  |  \-
> > > commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
> > > [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
> > > [INFO] |  |  |  +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
> > > [INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
> > > [INFO] |  |  |  \-
> org.apache.commons:commons-compress:jar:1.4.1:compile
> > > [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
> > > [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
> > > [INFO] |  |  +-
> > > org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
> > > [INFO] |  |  |  +-
> > > org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
> > > [INFO] |  |  |  |  +-
> > > org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
> > > [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
> > > [INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
> > > [INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
> > > [INFO] |  |  |  |  |  +-
> > >
> > >
> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
> > > [INFO] |  |  |  |  |  |  +-
> > >
> > >
> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
> > > [INFO] |  |  |  |  |  |  |  +-
> > > javax.servlet:javax.servlet-api:jar:3.0.1:compile
> > > [INFO] |  |  |  |  |  |  |  \-
> com.sun.jersey:jersey-client:jar:1.9:compile
> > > [INFO] |  |  |  |  |  |  \-
> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
> > > [INFO] |  |  |  |  |  |     +-
> > > org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
> > > [INFO] |  |  |  |  |  |     |  \-
> > > org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
> > > [INFO] |  |  |  |  |  |     |     \-
> > > org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
> > > [INFO] |  |  |  |  |  |     |        \-
> > > org.glassfish.external:management-api:jar:3.0.0-b012:compile
> > > [INFO] |  |  |  |  |  |     +-
> > > org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
> > > [INFO] |  |  |  |  |  |     |  \-
> > > org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
> > > [INFO] |  |  |  |  |  |     +-
> > > org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
> > > [INFO] |  |  |  |  |  |     \-
> org.glassfish:javax.servlet:jar:3.1:compile
> > > [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
> > > [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
> > > [INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-core:jar:1.9:compile
> > > [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
> > > [INFO] |  |  |  |  |  |  +-
> org.codehaus.jettison:jettison:jar:1.1:compile
> > > [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
> > > [INFO] |  |  |  |  |  |  +-
> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
> > > [INFO] |  |  |  |  |  |  |  \-
> javax.xml.bind:jaxb-api:jar:2.2.2:compile
> > > [INFO] |  |  |  |  |  |  |     \-
> > > javax.activation:activation:jar:1.1:compile
> > > [INFO] |  |  |  |  |  |  +-
> > > org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
> > > [INFO] |  |  |  |  |  |  \-
> > > org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
> > > [INFO] |  |  |  |  |  \-
> > > com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
> > > [INFO] |  |  |  |  \-
> > > org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
> > > [INFO] |  |  |  \-
> > > org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
> > > [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
> > > [INFO] |  |  +-
> > > org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
> > > [INFO] |  |  |  \-
> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
> > > [INFO] |  |  +-
> > > org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
> > > [INFO] |  |  \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
> > > [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
> > > [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
> > > [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
> > > [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
> > > [INFO] |  |  +- org.apache.curator:curator-framework:jar:2.4.0:compile
> > > [INFO] |  |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
> > > [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
> > > [INFO] |  |     \- jline:jline:jar:0.9.94:compile
> > > [INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
> > > [INFO] |  |  +-
> > >
> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
> > > [INFO] |  |  +-
> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
> > > [INFO] |  |  |  +-
> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
> > > [INFO] |  |  |  \-
> > > org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
> > > [INFO] |  |  \-
> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
> > > [INFO] |  |     \-
> > >
> > >
> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
> > > [INFO] |  |        \-
> > >
> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
> > > [INFO] |  +-
> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
> > > [INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
> > > [INFO] |  +-
> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
> > > [INFO] |  |  +-
> > > org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
> > > [INFO] |  |  +-
> > > org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
> > > [INFO] |  |  \-
> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
> > > [INFO] |  |     \-
> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
> > > [INFO] |  +- com.google.guava:guava:jar:16.0:compile
> > > d
> > >
> > > On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> > > wrote:
> > >
> > >> looks like it is also requested by mahout-math, wonder what is using
> it
> > >> there.
> > >>
> > >> At very least, it needs to be synchronized to the one currently used
> by
> > >> spark.
> > >>
> > >> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> mahout-hadoop
> > >> ---
> > >> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
> > >> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
> > >> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
> > >> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
> > >> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
> > >> [INFO] +-
> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
> > >> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> > >> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> > >>
> > >>
> > >> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <p...@occamsmachete.com>
> > > wrote:
> > >>
> > >>> Looks like Guava is in Spark.
> > >>>
> > >>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <p...@occamsmachete.com>
> wrote:
> > >>>
> > >>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like
> this
> > >>> would not be included since I think it was taken from the mrlegacy
> jar.
> > >>>
> > >>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <dlie...@gmail.com>
> > > wrote:
> > >>>
> > >>> ---------- Forwarded message ----------
> > >>> From: "Pat Ferrel" <p...@occamsmachete.com>
> > >>> Date: Jan 25, 2015 9:39 AM
> > >>> Subject: Re: Codebase refactoring proposal
> > >>> To: <dev@mahout.apache.org>
> > >>> Cc:
> > >>>
> > >>>> When you get a chance a PR would be good.
> > >>>
> > >>> Yes, it would. And not just for that.
> > >>>
> > >>>> As I understand it you are putting some class jars somewhere in the
> > >>> classpath. Where? How?
> > >>>>
> > >>>
> > >>> /bin/mahout
> > >>>
> > >>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
> > >>> 'bin/mahout -spark'.)
> > >>>
> > >>> If i interpret current shell code there correctky, legacy path tries
> to
> > >>> use
> > >>> examples assemblies if not packaged, or /lib if packaged. True
> > > motivation
> > >>> of that significantly predates 2010 and i suspect only Benson knows
> > > whole
> > >>> true intent there.
> > >>>
> > >>> The spark path, which is really a quick hack of the script, tries to
> get
> > >>> only selected mahout jars and locally instlalled spark classpath
> which i
> > >>> guess is just the shaded spark jar in recent spark releases. It also
> > >>> apparently tries to include /libs/*, which is never compiled in
> > > unpackaged
> > >>> version, and now i think it is a bug it is included  because /libs/*
> is
> > >>> apparently legacy packaging, and shouldnt be used  in spark jobs
> with a
> > >>> wildcard. I cant beleive how lazy i am, i still did not find time to
> > >>> understand mahout build in all cases.
> > >>>
> > >>> I am not even sure if packaged mahout will work with spark, honestly,
> > >>> because of the /lib. Never tried that, since i mostly use application
> > >>> embedding techniques.
> > >>>
> > >>> The same solution may apply to adding external dependencies and
> removing
> > >>> the assembly in the Spark module. Which would leave only one major
> build
> > >>> issue afaik.
> > >>>>
> > >>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> > >>> wrote:
> > >>>>
> > >>>> No, no PR. Only experiment on private. But i believe i sufficiently
> > >>> defined
> > >>>> what i want to do in order to gauge if we may want to advance it
> some
> > >>> time
> > >>>> later. Goal is much lighter dependency for spark code. Eliminate
> > >>> everything
> > >>>> that is not compile-time dependent. (and a lot of it is thru legacy
> MR
> > >>> code
> > >>>> which we of course don't use).
> > >>>>
> > >>>> Cant say i understand the remaining issues you are talking about
> > > though.
> > >>>>
> > >>>> If you are talking about compiling lib or shaded assembly, no, this
> > >>> doesn't
> > >>>> do anything about it. Although point is, as it stands, the algebra
> and
> > >>>> shell don't have any external dependencies but spark and these 4
> (5?)
> > >>>> mahout jars so they technically don't even need an assembly (as
> > >>>> demonstrated).
> > >>>>
> > >>>> As i said, it seems driver code is the only one that may need some
> > >>> external
> > >>>> dependencies, but that's a different scenario from those i am
> talking
> > >>>> about. But i am relatively happy with having the first two working
> > >>> nicely
> > >>>> at this point.
> > >>>>
> > >>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <p...@occamsmachete.com>
> > >>> wrote:
> > >>>>
> > >>>>> +1
> > >>>>>
> > >>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It would
> be
> > >>> nice
> > >>>>> to see how you’ve structured that in case we can use the same
> model to
> > >>>>> solve the two remaining refactoring issues.
> > >>>>> 1) external dependencies in the spark module
> > >>>>> 2) no spark or h2o in the release artifacts.
> > >>>>>
> > >>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <squ...@gatech.edu>
> wrote:
> > >>>>>
> > >>>>> Also +1
> > >>>>>
> > >>>>> iPhone'd
> > >>>>>
> > >>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <ap....@outlook.com>
> > > wrote:
> > >>>>>>
> > >>>>>> +1
> > >>>>>>
> > >>>>>>
> > >>>>>> Sent from my Verizon Wireless 4G LTE smartphone
> > >>>>>>
> > >>>>>> <div>-------- Original message --------</div><div>From: Dmitriy
> > >>> Lyubimov
> > >>>>> <dlie...@gmail.com> </div><div>Date:01/23/2015  6:06 PM
> (GMT-05:00)
> > >>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject: Codebase
> > >>>>> refactoring proposal </div><div>
> > >>>>>> </div>
> > >>>>>> So right now mahout-spark depends on mr-legacy.
> > >>>>>> I did quick refactoring and it turns out it only _irrevocably_
> > > depends
> > >>> on
> > >>>>>> the following classes there:
> > >>>>>>
> > >>>>>> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable,
> > > and
> > >>>>> ...
> > >>>>>> *sigh* o.a.m.common.Pair
> > >>>>>>
> > >>>>>> So  I just dropped those five classes into new a new tiny
> > >>> mahout-hadoop
> > >>>>>> module (to signify stuff that is directly relevant to serializing
> > >>> thigns
> > >>>>> to
> > >>>>>> DFS API) and completely removed mrlegacy and its transients from
> > > spark
> > >>>>> and
> > >>>>>> spark-shell dependencies.
> > >>>>>>
> > >>>>>> So non-cli applications (shell scripts and embedded api use)
> actually
> > >>>>> only
> > >>>>>> need spark dependencies (which come from SPARK_HOME classpath, of
> > >>> course)
> > >>>>>> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop
> and
> > >>>>>> optionally mahout-spark-shell (for running shell)).
> > >>>>>>
> > >>>>>> This of course still doesn't address driver problems that want to
> > >>> throw
> > >>>>>> more stuff into front-end classpath (such as cli parser) but at
> least
> > >>> it
> > >>>>>> renders transitive luggage of mr-legacy (and the size of
> > >>> worker-shipped
> > >>>>>> jars) much more tolerable.
> > >>>>>>
> > >>>>>> How does that sound?
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>
> > >
> > >
> >
>

Re: Codebase refactoring proposal

Reply via email to