Re: Codebase refactoring proposal

Dmitriy Lyubimov Wed, 04 Feb 2015 10:38:37 -0800

btw a good seq2sparse and seqdirectory ports are the only thing that
separates us from having bigram, trigram based LSA tutorial.


On Wed, Feb 4, 2015 at 10:35 AM, Dmitriy Lyubimov <[email protected]> wrote:

> i think they are debating the details now, not the idea. Like how "NA" is
> different from "null" in classic dataframe representation etc.
>
> On Wed, Feb 4, 2015 at 8:18 AM, Suneel Marthi <[email protected]>
> wrote:
>
>> I believe they r still debating about renaming SchemaRDD -> Data Frame.  I
>> must admit Dmitriy had suggested this to me few months ago reusing
>> SchemaRDD if possible. Dmitriy was right "U told us".
>>
>> On Wed, Feb 4, 2015 at 11:09 AM, Pat Ferrel <[email protected]>
>> wrote:
>>
>> > This sound like a great idea but I wonder is we can get rid of Mahout
>> DRM
>> > as a native format. If we have DataFrames (have they actually renamed
>> > SchemaRDD?) backed DRMs we ideally don’t need Mahout native DRMs or
>> > IndexedDatasets, right? This would be a huge step! If we get data
>> > interchangeability with MLlib its a win. If we get general row and
>> column
>> > IDs that follow the data through math, its a win. Need to think through
>> how
>> > to use a DataFrame in a streaming case, probably through some
>> checkpointing
>> > of the window DStream—hmm.
>> >
>> > On Feb 4, 2015, at 7:37 AM, Andrew Palumbo <[email protected]> wrote:
>> >
>> >
>> > On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>> > > I'd suggest to consider this: remember all this talk about
>> > > language-integrated spark ql being basically dataframe manipulation
>> DSL?
>> > >
>> > > so now Spark devs are noticing this generality as well and are
>> actually
>> > > proposing to rename SchemaRDD into DataFrame and make it mainstream
>> data
>> > > structure. (my "told you so" moment of sorts :)
>> > >
>> > > What i am getting at, i'd suggest to make DRM and Spark's newly
>> renamed
>> > > DataFrame our two major structures. In particular, standardize on
>> using
>> > > DataFrame for things that may include non-numerical data and require
>> more
>> > > grace about column naming and manipulation. Maybe relevant to TF-IDF
>> work
>> > > when it deals with non-matrix content.
>> > Sounds like a worthy effort to me.  We'd be basically implementing an
>> API
>> > at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>> >
>> > On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <[email protected]>
>> wrote:
>> > >> Seems like seq2sparse would be really easy to replace since it takes
>> > text
>> > >> files to start with, then the whole pipeline could be kept in rdds.
>> The
>> > >> dictionaries and counts could be either in-memory maps or rdds for
>> use
>> > with
>> > >> joins? This would get rid of sequence files completely from the
>> > pipeline.
>> > >> Item similarity uses in-memory maps but the plan is to make it more
>> > >> scalable using joins as an alternative with the same API allowing the
>> > user
>> > >> to trade-off footprint for speed.
>> >
>> > I think you're right- should be relatively easy.  I've been looking at
>> > porting seq2sparse  to the DSL for bit now and the stopper at the DSL
>> level
>> > is that we don't have a distributed data structure for strings..Seems
>> like
>> > getting a DataFrame implemented as Dmitriy mentioned above would take
>> care
>> > of this problem.
>> >
>> > The other issue i'm a little fuzzy on  is the distributed collocation
>> > mapping-  it's a part of the seq2sparse code that I've not spent too
>> much
>> > time in.
>> >
>> > I think that this would be very worthy effort as well-  I believe
>> > seq2sparse is a particular strong mahout feature.
>> >
>> > I'll start another thread since we're now way off topic from the
>> > refactoring proposal.
>> > >>
>> > >> My use for TF-IDF is for row similarity and would take a DRM
>> (actually
>> > >> IndexedDataset) and calculate row/doc similarities. It works now but
>> > only
>> > >> using LLR. This is OK when thinking of the items as tags or metadata
>> but
>> > >> for text tokens something like cosine may be better.
>> > >>
>> > >> I’d imagine a downsampling phase that would precede TF-IDF using LLR
>> a
>> > lot
>> > >> like how CF preferences are downsampled. This would produce an
>> > sparsified
>> > >> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight
>> the
>> > >> terms before row similarity uses cosine. This is not so good for
>> search
>> > but
>> > >> should produce much better similarities than Solr’s “moreLikeThis”
>> and
>> > does
>> > >> it for all pairs rather than one at a time.
>> > >>
>> > >> In any case it can be used to do a create a personalized
>> content-based
>> > >> recommender or augment a CF recommender with one more indicator type.
>> > >>
>> > >> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <[email protected]>
>> wrote:
>> > >>
>> > >>
>> > >> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>> > >>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>> > >>>> Some issues WRT lower level Spark integration:
>> > >>>> 1) interoperability with Spark data. TF-IDF is one example I
>> actually
>> > >> looked at. There may be other things we can pick up from their
>> > committers
>> > >> since they have an abundance.
>> > >>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated
>> to
>> > >> me when someone on the Spark list asked about matrix transpose and an
>> > MLlib
>> > >> committer’s answer was something like “why would you want to do
>> that?”.
>> > >> Usually you don’t actually execute the transpose but they don’t even
>> > >> support A’A, AA’, or A’B, which are core to what I work on. At
>> present
>> > you
>> > >> pretty much have to choose between MLlib or Mahout for sparse matrix
>> > stuff.
>> > >> Maybe a half-way measure is some implicit conversions (ugh, I know).
>> If
>> > the
>> > >> DSL could interchange datasets with MLlib, people would be pointed to
>> > the
>> > >> DSL for all of a bunch of “why would you want to do that?” features.
>> > MLlib
>> > >> seems to be algorithms, not math.
>> > >>>> 3) integration of Streaming. DStreams support most of the RDD
>> > >> interface. Doing a batch recalc on a moving time window would nearly
>> > fall
>> > >> out of DStream backed DRMs. This isn’t the same as incremental
>> updates
>> > on
>> > >> streaming but it’s a start.
>> > >>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>> > >> faster compute engines. So we jumped. Now the need is for streaming
>> and
>> > >> especially incrementally updated streaming. Seems like we need to
>> > address
>> > >> this.
>> > >>>> Andrew, regardless of the above having TF-IDF would be super
>> > >> helpful—row similarity for content/text would benefit greatly.
>> > >>>   I will put a PR up soon.
>> > >> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
>> > classes
>> > >> and Weight interface over from mr-legacy to math-scala. They're
>> > available
>> > >> now in spark-shell but won't be after this refactoring.  These still
>> > >> require dictionary and a frequency count maps to vectorize incoming
>> > text-
>> > >> so they're more for use with the old MR seq2sparse and I don't think
>> > they
>> > >> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>> > >> Hopefully they'll be of some use.
>> > >>
>> > >> On Feb 3, 2015, at 8:47 AM, Dmitriy Lyubimov <[email protected]>
>> wrote:
>> > >>>> But first I need to do massive fixes and improvements to the
>> > distributed
>> > >>>> optimizer itself. Still waiting on green light for that.
>> > >>>> On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <[email protected]>
>> wrote:
>> > >>>>
>> > >>>>> On Feb 3, 2015 7:20 AM, "Pat Ferrel" <[email protected]>
>> wrote:
>> > >>>>>> BTW what level of difficulty would making the DSL run on MLlib
>> > Vectors
>> > >>>>> and RowMatrix be? Looking at using their hashing TF-IDF but it
>> raises
>> > >>>>> impedance mismatch between DRM and MLlib RowMatrix. This would
>> > further
>> > >>>>> reduce artifact size by a bunch.
>> > >>>>>
>> > >>>>> Short answer, if it were possible, I'd not bother with Mahout code
>> > >> base at
>> > >>>>> all. The problem is it lacks sufficient flexibility semantics and
>> > >>>>> abstruction. Breeze is indefinitely better in that department but
>> at
>> > >> the
>> > >>>>> time it was sufficiently worse on abstracting interoperability of
>> > >> matrices
>> > >>>>> with different structures. And mllib does not expose breeze.
>> > >>>>>
>> > >>>>> Looking forward toward hardware acellerated bolt-on work I just
>> must
>> > >> say
>> > >>>>> after reading breeze code for some time I still have much clearer
>> > plan
>> > >> how
>> > >>>>> such back hybridization and cost calibration might work with
>> current
>> > >> Mahout
>> > >>>>> math abstractions than with breeze. It is also more in line with
>> my
>> > >> current
>> > >>>>> work tasks.
>> > >>>>>
>> > >>>>>> Also backing something like a DRM with DStreams. Periodic model
>> > recalc
>> > >>>>> with streams is maybe the first step towards truly streaming
>> algos.
>> > >> Looking
>> > >>>>> at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
>> > >>>>> similarity. Attach Kafka and get evergreen models, if not
>> > incrementally
>> > >>>>> updating models.
>> > >>>>>> On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <[email protected]>
>> > >> wrote:
>> > >>>>>> bottom line compile-time dependencies are satisfied with no extra
>> > >> stuff
>> > >>>>>> from mr-legacy or its transitives. This is proven by virtue of
>> > >>>>> successful
>> > >>>>>> compilation with no dependency on mr-legacy on the tree.
>> > >>>>>>
>> > >>>>>> Runtime sufficiency for no extra dependency is proven via running
>> > >> shell
>> > >>>>> or
>> > >>>>>> embedded tests (unit tests) which are successful too. This
>> implies
>> > >>>>>> embedding and shell apis.
>> > >>>>>>
>> > >>>>>> Issue with guava is typical one. if it were an issue, i wouldn't
>> be
>> > >> able
>> > >>>>> to
>> > >>>>>> compile and/or run stuff. Now, question is what do we do if
>> drivers
>> > >> want
>> > >>>>>> extra stuff that is not found in Spark.
>> > >>>>>>
>> > >>>>>> Now, It is so nice not to depend on anything extra so i am
>> hesitant
>> > to
>> > >>>>>> offer anything  here. either shading or lib with opt-in
>> dependency
>> > >> policy
>> > >>>>>> would suffice though, since it doesn't look like we'd have to
>> have
>> > >> tons
>> > >>>>> of
>> > >>>>>> extra for drivers.
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <
>> [email protected]
>> > >
>> > >>>>> wrote:
>> > >>>>>>> I vaguely remember there being a Guava version problem where the
>> > >>>>> version
>> > >>>>>>> had to be rolled back in one of the hadoop modules. The
>> math-scala
>> > >>>>>>> IndexedDataset shouldn’t care about version.
>> > >>>>>>>
>> > >>>>>>> BTW It seems pretty easy to take out the option parser and
>> replace
>> > >> with
>> > >>>>>>> match and tuples especially if we can extend the Scala App
>> class.
>> > It
>> > >>>>> might
>> > >>>>>>> actually simplify things since I can then use several case
>> classes
>> > to
>> > >>>>> hold
>> > >>>>>>> options (scopt needed one object), which in turn takes out all
>> > those
>> > >>>>> ugly
>> > >>>>>>> casts. I’ll take a look next time I’m in there.
>> > >>>>>>>
>> > >>>>>>> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <
>> [email protected]>
>> > >>>>> wrote:
>> > >>>>>>> in 'spark' module it is overwritten with spark dependency, which
>> > also
>> > >>>>> comes
>> > >>>>>>> at the same version so happens. so should be fine with 1.1.x
>> > >>>>>>>
>> > >>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>> > >>>>>>> mahout-spark_2.10 ---
>> > >>>>>>> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
>> > >>>>>>> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
>> > >>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  +-
>> org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
>> > >>>>>>> [INFO] |  |  |  +-
>> org.apache.commons:commons-math:jar:2.1:compile
>> > >>>>>>> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
>> > >>>>>>> [INFO] |  |  |  +-
>> > commons-logging:commons-logging:jar:1.1.3:compile
>> > >>>>>>> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
>> > >>>>>>> [INFO] |  |  |  +-
>> > >>>>>>> commons-configuration:commons-configuration:jar:1.6:compile
>> > >>>>>>> [INFO] |  |  |  |  +-
>> > >>>>>>> commons-collections:commons-collections:jar:3.2.1:compile
>> > >>>>>>> [INFO] |  |  |  |  +-
>> > >> commons-digester:commons-digester:jar:1.8:compile
>> > >>>>>>> [INFO] |  |  |  |  |  \-
>> > >>>>>>> commons-beanutils:commons-beanutils:jar:1.7.0:compile
>> > >>>>>>> [INFO] |  |  |  |  \-
>> > >>>>>>> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
>> > >>>>>>> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
>> > >>>>>>> [INFO] |  |  |  +-
>> > >> com.google.protobuf:protobuf-java:jar:2.5.0:compile
>> > >>>>>>> [INFO] |  |  |  +-
>> org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  |  \-
>> > >>>>> org.apache.commons:commons-compress:jar:1.4.1:compile
>> > >>>>>>> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
>> > >>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  +-
>> > >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  |  +-
>> > >>>>>>>
>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  |  |  +-
>> > >>>>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  +-
>> javax.inject:javax.inject:jar:1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  \-
>> aopalliance:aopalliance:jar:1.0:compile
>> > >>>>>>> [INFO] |  |  |  |  |  +-
>> > >>>>>>>
>> > >>>>>>>
>> > >>
>> >
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  +-
>> > >>>>>>>
>> > >>>>>>>
>> > >>
>> >
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  |  +-
>> > >>>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  |  \-
>> > >>>>> com.sun.jersey:jersey-client:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  \-
>> > >>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |     +-
>> > >>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |     |  \-
>> > >>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |     |     \-
>> > >>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |     |        \-
>> > >>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |     +-
>> > >>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |     |  \-
>> > >>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |     +-
>> > >>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |     \-
>> > >>>>> org.glassfish:javax.servlet:jar:3.1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  +-
>> > com.sun.jersey:jersey-server:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  \-
>> > >> com.sun.jersey:jersey-core:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  +-
>> com.sun.jersey:jersey-json:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  +-
>> > >>>>> org.codehaus.jettison:jettison:jar:1.1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  +-
>> > >>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  |  \-
>> > >>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  |     \-
>> > >>>>>>> javax.activation:activation:jar:1.1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  +-
>> > >>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  \-
>> > >>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
>> > >>>>>>> [INFO] |  |  |  |  |  \-
>> > >>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  \-
>> > >>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  |  \-
>> > >>>>>>>
>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  +-
>> org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  +-
>> > >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  |  \-
>> > >>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  +-
>> > >>>>>>>
>> > org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  \-
>> > >> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
>> > >>>>>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
>> > >>>>>>> [INFO] |  |  \-
>> > commons-httpclient:commons-httpclient:jar:3.1:compile
>> > >>>>>>> [INFO] |  +-
>> org.apache.curator:curator-recipes:jar:2.4.0:compile
>> > >>>>>>> [INFO] |  |  +-
>> > >> org.apache.curator:curator-framework:jar:2.4.0:compile
>> > >>>>>>> [INFO] |  |  |  \-
>> > >> org.apache.curator:curator-client:jar:2.4.0:compile
>> > >>>>>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
>> > >>>>>>> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
>> > >>>>>>> [INFO] |  +-
>> > >> org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  |  +-
>> > >>>>>>>
>> > >>
>> >
>> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
>> > >>>>>>> [INFO] |  |  +-
>> > >>>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  |  |  +-
>> > >>>>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  |  |  \-
>> > >>>>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  |  \-
>> > >>>>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  |     \-
>> > >>>>>>>
>> > >>>>>>>
>> > >>
>> >
>> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
>> > >>>>>>> [INFO] |  |        \-
>> > >>>>>>>
>> > >>
>> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
>> > >>>>>>> [INFO] |  +-
>> > >>>>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  +-
>> > >> org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  +-
>> > >>>>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  |  +-
>> > >>>>>>>
>> > org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
>> > >>>>>>> [INFO] |  |  +-
>> > >>>>>>>
>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  |  \-
>> > >>>>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  |     \-
>> > >>>>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
>> > >>>>>>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
>> > >>>>>>> d
>> > >>>>>>>
>> > >>>>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <
>> > [email protected]
>> > >>>>>>> wrote:
>> > >>>>>>>
>> > >>>>>>>> looks like it is also requested by mahout-math, wonder what is
>> > using
>> > >>>>> it
>> > >>>>>>>> there.
>> > >>>>>>>>
>> > >>>>>>>> At very least, it needs to be synchronized to the one currently
>> > used
>> > >>>>> by
>> > >>>>>>>> spark.
>> > >>>>>>>>
>> > >>>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>> > >>>>> mahout-hadoop
>> > >>>>>>>> ---
>> > >>>>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
>> > >>>>>>>> *[INFO] +-
>> org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
>> > >>>>>>>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
>> > >>>>>>>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
>> > >>>>>>>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
>> > >>>>>>>> [INFO] +-
>> > >>>>> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
>> > >>>>>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>> > >>>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <
>> > [email protected]>
>> > >>>>>>> wrote:
>> > >>>>>>>>> Looks like Guava is in Spark.
>> > >>>>>>>>>
>> > >>>>>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <
>> [email protected]>
>> > >>>>> wrote:
>> > >>>>>>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds
>> > like
>> > >>>>> this
>> > >>>>>>>>> would not be included since I think it was taken from the
>> > mrlegacy
>> > >>>>> jar.
>> > >>>>>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <
>> > [email protected]>
>> > >>>>>>> wrote:
>> > >>>>>>>>> ---------- Forwarded message ----------
>> > >>>>>>>>> From: "Pat Ferrel" <[email protected]>
>> > >>>>>>>>> Date: Jan 25, 2015 9:39 AM
>> > >>>>>>>>> Subject: Re: Codebase refactoring proposal
>> > >>>>>>>>> To: <[email protected]>
>> > >>>>>>>>> Cc:
>> > >>>>>>>>>
>> > >>>>>>>>>> When you get a chance a PR would be good.
>> > >>>>>>>>> Yes, it would. And not just for that.
>> > >>>>>>>>>
>> > >>>>>>>>>> As I understand it you are putting some class jars somewhere
>> in
>> > >> the
>> > >>>>>>>>> classpath. Where? How?
>> > >>>>>>>>> /bin/mahout
>> > >>>>>>>>>
>> > >>>>>>>>> (Computes 2 different classpaths. See  'bin/mahout classpath'
>> vs.
>> > >>>>>>>>> 'bin/mahout -spark'.)
>> > >>>>>>>>>
>> > >>>>>>>>> If i interpret current shell code there correctky, legacy path
>> > >> tries
>> > >>>>> to
>> > >>>>>>>>> use
>> > >>>>>>>>> examples assemblies if not packaged, or /lib if packaged. True
>> > >>>>>>> motivation
>> > >>>>>>>>> of that significantly predates 2010 and i suspect only Benson
>> > knows
>> > >>>>>>> whole
>> > >>>>>>>>> true intent there.
>> > >>>>>>>>>
>> > >>>>>>>>> The spark path, which is really a quick hack of the script,
>> tries
>> > >> to
>> > >>>>> get
>> > >>>>>>>>> only selected mahout jars and locally instlalled spark
>> classpath
>> > >>>>> which i
>> > >>>>>>>>> guess is just the shaded spark jar in recent spark releases.
>> It
>> > >> also
>> > >>>>>>>>> apparently tries to include /libs/*, which is never compiled
>> in
>> > >>>>>>> unpackaged
>> > >>>>>>>>> version, and now i think it is a bug it is included  because
>> > >> /libs/*
>> > >>>>> is
>> > >>>>>>>>> apparently legacy packaging, and shouldnt be used  in spark
>> jobs
>> > >>>>> with a
>> > >>>>>>>>> wildcard. I cant beleive how lazy i am, i still did not find
>> time
>> > >> to
>> > >>>>>>>>> understand mahout build in all cases.
>> > >>>>>>>>>
>> > >>>>>>>>> I am not even sure if packaged mahout will work with spark,
>> > >> honestly,
>> > >>>>>>>>> because of the /lib. Never tried that, since i mostly use
>> > >> application
>> > >>>>>>>>> embedding techniques.
>> > >>>>>>>>>
>> > >>>>>>>>> The same solution may apply to adding external dependencies
>> and
>> > >>>>> removing
>> > >>>>>>>>> the assembly in the Spark module. Which would leave only one
>> > major
>> > >>>>> build
>> > >>>>>>>>> issue afaik.
>> > >>>>>>>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <
>> > [email protected]
>> > >>>>>>>>> wrote:
>> > >>>>>>>>>> No, no PR. Only experiment on private. But i believe i
>> > >> sufficiently
>> > >>>>>>>>> defined
>> > >>>>>>>>>> what i want to do in order to gauge if we may want to
>> advance it
>> > >>>>> some
>> > >>>>>>>>> time
>> > >>>>>>>>>> later. Goal is much lighter dependency for spark code.
>> Eliminate
>> > >>>>>>>>> everything
>> > >>>>>>>>>> that is not compile-time dependent. (and a lot of it is thru
>> > >> legacy
>> > >>>>> MR
>> > >>>>>>>>> code
>> > >>>>>>>>>> which we of course don't use).
>> > >>>>>>>>>>
>> > >>>>>>>>>> Cant say i understand the remaining issues you are talking
>> about
>> > >>>>>>> though.
>> > >>>>>>>>>> If you are talking about compiling lib or shaded assembly,
>> no,
>> > >> this
>> > >>>>>>>>> doesn't
>> > >>>>>>>>>> do anything about it. Although point is, as it stands, the
>> > algebra
>> > >>>>> and
>> > >>>>>>>>>> shell don't have any external dependencies but spark and
>> these 4
>> > >>>>> (5?)
>> > >>>>>>>>>> mahout jars so they technically don't even need an assembly
>> (as
>> > >>>>>>>>>> demonstrated).
>> > >>>>>>>>>>
>> > >>>>>>>>>> As i said, it seems driver code is the only one that may need
>> > some
>> > >>>>>>>>> external
>> > >>>>>>>>>> dependencies, but that's a different scenario from those i am
>> > >>>>> talking
>> > >>>>>>>>>> about. But i am relatively happy with having the first two
>> > working
>> > >>>>>>>>> nicely
>> > >>>>>>>>>> at this point.
>> > >>>>>>>>>>
>> > >>>>>>>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <
>> > >> [email protected]>
>> > >>>>>>>>> wrote:
>> > >>>>>>>>>>> +1
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It
>> > >> would
>> > >>>>> be
>> > >>>>>>>>> nice
>> > >>>>>>>>>>> to see how you’ve structured that in case we can use the
>> same
>> > >>>>> model to
>> > >>>>>>>>>>> solve the two remaining refactoring issues.
>> > >>>>>>>>>>> 1) external dependencies in the spark module
>> > >>>>>>>>>>> 2) no spark or h2o in the release artifacts.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <
>> [email protected]>
>> > >>>>> wrote:
>> > >>>>>>>>>>> Also +1
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> iPhone'd
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <
>> [email protected]
>> > >
>> > >>>>>>> wrote:
>> > >>>>>>>>>>>> +1
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> <div>-------- Original message --------</div><div>From:
>> > Dmitriy
>> > >>>>>>>>> Lyubimov
>> > >>>>>>>>>>> <[email protected]> </div><div>Date:01/23/2015  6:06 PM
>> > >>>>> (GMT-05:00)
>> > >>>>>>>>>>> </div><div>To: [email protected] </div><div>Subject:
>> > >> Codebase
>> > >>>>>>>>>>> refactoring proposal </div><div>
>> > >>>>>>>>>>>> </div>
>> > >>>>>>>>>>>> So right now mahout-spark depends on mr-legacy.
>> > >>>>>>>>>>>> I did quick refactoring and it turns out it only
>> _irrevocably_
>> > >>>>>>> depends
>> > >>>>>>>>> on
>> > >>>>>>>>>>>> the following classes there:
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and
>> > >> VarintWritable,
>> > >>>>>>> and
>> > >>>>>>>>>>> ...
>> > >>>>>>>>>>>> *sigh* o.a.m.common.Pair
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> So  I just dropped those five classes into new a new tiny
>> > >>>>>>>>> mahout-hadoop
>> > >>>>>>>>>>>> module (to signify stuff that is directly relevant to
>> > >> serializing
>> > >>>>>>>>> thigns
>> > >>>>>>>>>>> to
>> > >>>>>>>>>>>> DFS API) and completely removed mrlegacy and its transients
>> > from
>> > >>>>>>> spark
>> > >>>>>>>>>>> and
>> > >>>>>>>>>>>> spark-shell dependencies.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> So non-cli applications (shell scripts and embedded api
>> use)
>> > >>>>> actually
>> > >>>>>>>>>>> only
>> > >>>>>>>>>>>> need spark dependencies (which come from SPARK_HOME
>> classpath,
>> > >> of
>> > >>>>>>>>> course)
>> > >>>>>>>>>>>> and mahout jars (mahout-spark, mahout-math(-scala),
>> > >> mahout-hadoop
>> > >>>>> and
>> > >>>>>>>>>>>> optionally mahout-spark-shell (for running shell)).
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> This of course still doesn't address driver problems that
>> want
>> > >> to
>> > >>>>>>>>> throw
>> > >>>>>>>>>>>> more stuff into front-end classpath (such as cli parser)
>> but
>> > at
>> > >>>>> least
>> > >>>>>>>>> it
>> > >>>>>>>>>>>> renders transitive luggage of mr-legacy (and the size of
>> > >>>>>>>>> worker-shipped
>> > >>>>>>>>>>>> jars) much more tolerable.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> How does that sound?
>> > >>>>>>>>>
>> > >>
>> > >>
>> >
>> >
>> >
>>
>
>

Re: Codebase refactoring proposal

Reply via email to