btw a good seq2sparse and seqdirectory ports are the only thing that separates us from having bigram, trigram based LSA tutorial.
On Wed, Feb 4, 2015 at 10:35 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > i think they are debating the details now, not the idea. Like how "NA" is > different from "null" in classic dataframe representation etc. > > On Wed, Feb 4, 2015 at 8:18 AM, Suneel Marthi <suneel.mar...@gmail.com> > wrote: > >> I believe they r still debating about renaming SchemaRDD -> Data Frame. I >> must admit Dmitriy had suggested this to me few months ago reusing >> SchemaRDD if possible. Dmitriy was right "U told us". >> >> On Wed, Feb 4, 2015 at 11:09 AM, Pat Ferrel <p...@occamsmachete.com> >> wrote: >> >> > This sound like a great idea but I wonder is we can get rid of Mahout >> DRM >> > as a native format. If we have DataFrames (have they actually renamed >> > SchemaRDD?) backed DRMs we ideally don’t need Mahout native DRMs or >> > IndexedDatasets, right? This would be a huge step! If we get data >> > interchangeability with MLlib its a win. If we get general row and >> column >> > IDs that follow the data through math, its a win. Need to think through >> how >> > to use a DataFrame in a streaming case, probably through some >> checkpointing >> > of the window DStream—hmm. >> > >> > On Feb 4, 2015, at 7:37 AM, Andrew Palumbo <ap....@outlook.com> wrote: >> > >> > >> > On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote: >> > > I'd suggest to consider this: remember all this talk about >> > > language-integrated spark ql being basically dataframe manipulation >> DSL? >> > > >> > > so now Spark devs are noticing this generality as well and are >> actually >> > > proposing to rename SchemaRDD into DataFrame and make it mainstream >> data >> > > structure. (my "told you so" moment of sorts :) >> > > >> > > What i am getting at, i'd suggest to make DRM and Spark's newly >> renamed >> > > DataFrame our two major structures. In particular, standardize on >> using >> > > DataFrame for things that may include non-numerical data and require >> more >> > > grace about column naming and manipulation. Maybe relevant to TF-IDF >> work >> > > when it deals with non-matrix content. >> > Sounds like a worthy effort to me. We'd be basically implementing an >> API >> > at the math-scala level for SchemaRDD/Dataframe datastructures correct? >> > >> > On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <p...@occamsmachete.com> >> wrote: >> > >> Seems like seq2sparse would be really easy to replace since it takes >> > text >> > >> files to start with, then the whole pipeline could be kept in rdds. >> The >> > >> dictionaries and counts could be either in-memory maps or rdds for >> use >> > with >> > >> joins? This would get rid of sequence files completely from the >> > pipeline. >> > >> Item similarity uses in-memory maps but the plan is to make it more >> > >> scalable using joins as an alternative with the same API allowing the >> > user >> > >> to trade-off footprint for speed. >> > >> > I think you're right- should be relatively easy. I've been looking at >> > porting seq2sparse to the DSL for bit now and the stopper at the DSL >> level >> > is that we don't have a distributed data structure for strings..Seems >> like >> > getting a DataFrame implemented as Dmitriy mentioned above would take >> care >> > of this problem. >> > >> > The other issue i'm a little fuzzy on is the distributed collocation >> > mapping- it's a part of the seq2sparse code that I've not spent too >> much >> > time in. >> > >> > I think that this would be very worthy effort as well- I believe >> > seq2sparse is a particular strong mahout feature. >> > >> > I'll start another thread since we're now way off topic from the >> > refactoring proposal. >> > >> >> > >> My use for TF-IDF is for row similarity and would take a DRM >> (actually >> > >> IndexedDataset) and calculate row/doc similarities. It works now but >> > only >> > >> using LLR. This is OK when thinking of the items as tags or metadata >> but >> > >> for text tokens something like cosine may be better. >> > >> >> > >> I’d imagine a downsampling phase that would precede TF-IDF using LLR >> a >> > lot >> > >> like how CF preferences are downsampled. This would produce an >> > sparsified >> > >> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight >> the >> > >> terms before row similarity uses cosine. This is not so good for >> search >> > but >> > >> should produce much better similarities than Solr’s “moreLikeThis” >> and >> > does >> > >> it for all pairs rather than one at a time. >> > >> >> > >> In any case it can be used to do a create a personalized >> content-based >> > >> recommender or augment a CF recommender with one more indicator type. >> > >> >> > >> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap....@outlook.com> >> wrote: >> > >> >> > >> >> > >> On 02/03/2015 12:44 PM, Andrew Palumbo wrote: >> > >>> On 02/03/2015 12:22 PM, Pat Ferrel wrote: >> > >>>> Some issues WRT lower level Spark integration: >> > >>>> 1) interoperability with Spark data. TF-IDF is one example I >> actually >> > >> looked at. There may be other things we can pick up from their >> > committers >> > >> since they have an abundance. >> > >>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated >> to >> > >> me when someone on the Spark list asked about matrix transpose and an >> > MLlib >> > >> committer’s answer was something like “why would you want to do >> that?”. >> > >> Usually you don’t actually execute the transpose but they don’t even >> > >> support A’A, AA’, or A’B, which are core to what I work on. At >> present >> > you >> > >> pretty much have to choose between MLlib or Mahout for sparse matrix >> > stuff. >> > >> Maybe a half-way measure is some implicit conversions (ugh, I know). >> If >> > the >> > >> DSL could interchange datasets with MLlib, people would be pointed to >> > the >> > >> DSL for all of a bunch of “why would you want to do that?” features. >> > MLlib >> > >> seems to be algorithms, not math. >> > >>>> 3) integration of Streaming. DStreams support most of the RDD >> > >> interface. Doing a batch recalc on a moving time window would nearly >> > fall >> > >> out of DStream backed DRMs. This isn’t the same as incremental >> updates >> > on >> > >> streaming but it’s a start. >> > >>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink >> > >> faster compute engines. So we jumped. Now the need is for streaming >> and >> > >> especially incrementally updated streaming. Seems like we need to >> > address >> > >> this. >> > >>>> Andrew, regardless of the above having TF-IDF would be super >> > >> helpful—row similarity for content/text would benefit greatly. >> > >>> I will put a PR up soon. >> > >> Just to clarify, I'll be porting over the (very simple) TF, TFIDF >> > classes >> > >> and Weight interface over from mr-legacy to math-scala. They're >> > available >> > >> now in spark-shell but won't be after this refactoring. These still >> > >> require dictionary and a frequency count maps to vectorize incoming >> > text- >> > >> so they're more for use with the old MR seq2sparse and I don't think >> > they >> > >> can be used with Spark's HashingTF and IDF. I'll put them up soon. >> > >> Hopefully they'll be of some use. >> > >> >> > >> On Feb 3, 2015, at 8:47 AM, Dmitriy Lyubimov <dlie...@gmail.com> >> wrote: >> > >>>> But first I need to do massive fixes and improvements to the >> > distributed >> > >>>> optimizer itself. Still waiting on green light for that. >> > >>>> On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <dlie...@gmail.com> >> wrote: >> > >>>> >> > >>>>> On Feb 3, 2015 7:20 AM, "Pat Ferrel" <p...@occamsmachete.com> >> wrote: >> > >>>>>> BTW what level of difficulty would making the DSL run on MLlib >> > Vectors >> > >>>>> and RowMatrix be? Looking at using their hashing TF-IDF but it >> raises >> > >>>>> impedance mismatch between DRM and MLlib RowMatrix. This would >> > further >> > >>>>> reduce artifact size by a bunch. >> > >>>>> >> > >>>>> Short answer, if it were possible, I'd not bother with Mahout code >> > >> base at >> > >>>>> all. The problem is it lacks sufficient flexibility semantics and >> > >>>>> abstruction. Breeze is indefinitely better in that department but >> at >> > >> the >> > >>>>> time it was sufficiently worse on abstracting interoperability of >> > >> matrices >> > >>>>> with different structures. And mllib does not expose breeze. >> > >>>>> >> > >>>>> Looking forward toward hardware acellerated bolt-on work I just >> must >> > >> say >> > >>>>> after reading breeze code for some time I still have much clearer >> > plan >> > >> how >> > >>>>> such back hybridization and cost calibration might work with >> current >> > >> Mahout >> > >>>>> math abstractions than with breeze. It is also more in line with >> my >> > >> current >> > >>>>> work tasks. >> > >>>>> >> > >>>>>> Also backing something like a DRM with DStreams. Periodic model >> > recalc >> > >>>>> with streams is maybe the first step towards truly streaming >> algos. >> > >> Looking >> > >>>>> at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row >> > >>>>> similarity. Attach Kafka and get evergreen models, if not >> > incrementally >> > >>>>> updating models. >> > >>>>>> On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dlie...@gmail.com> >> > >> wrote: >> > >>>>>> bottom line compile-time dependencies are satisfied with no extra >> > >> stuff >> > >>>>>> from mr-legacy or its transitives. This is proven by virtue of >> > >>>>> successful >> > >>>>>> compilation with no dependency on mr-legacy on the tree. >> > >>>>>> >> > >>>>>> Runtime sufficiency for no extra dependency is proven via running >> > >> shell >> > >>>>> or >> > >>>>>> embedded tests (unit tests) which are successful too. This >> implies >> > >>>>>> embedding and shell apis. >> > >>>>>> >> > >>>>>> Issue with guava is typical one. if it were an issue, i wouldn't >> be >> > >> able >> > >>>>> to >> > >>>>>> compile and/or run stuff. Now, question is what do we do if >> drivers >> > >> want >> > >>>>>> extra stuff that is not found in Spark. >> > >>>>>> >> > >>>>>> Now, It is so nice not to depend on anything extra so i am >> hesitant >> > to >> > >>>>>> offer anything here. either shading or lib with opt-in >> dependency >> > >> policy >> > >>>>>> would suffice though, since it doesn't look like we'd have to >> have >> > >> tons >> > >>>>> of >> > >>>>>> extra for drivers. >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel < >> p...@occamsmachete.com >> > > >> > >>>>> wrote: >> > >>>>>>> I vaguely remember there being a Guava version problem where the >> > >>>>> version >> > >>>>>>> had to be rolled back in one of the hadoop modules. The >> math-scala >> > >>>>>>> IndexedDataset shouldn’t care about version. >> > >>>>>>> >> > >>>>>>> BTW It seems pretty easy to take out the option parser and >> replace >> > >> with >> > >>>>>>> match and tuples especially if we can extend the Scala App >> class. >> > It >> > >>>>> might >> > >>>>>>> actually simplify things since I can then use several case >> classes >> > to >> > >>>>> hold >> > >>>>>>> options (scopt needed one object), which in turn takes out all >> > those >> > >>>>> ugly >> > >>>>>>> casts. I’ll take a look next time I’m in there. >> > >>>>>>> >> > >>>>>>> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov < >> dlie...@gmail.com> >> > >>>>> wrote: >> > >>>>>>> in 'spark' module it is overwritten with spark dependency, which >> > also >> > >>>>> comes >> > >>>>>>> at the same version so happens. so should be fine with 1.1.x >> > >>>>>>> >> > >>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ >> > >>>>>>> mahout-spark_2.10 --- >> > >>>>>>> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT >> > >>>>>>> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile >> > >>>>>>> [INFO] | +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile >> > >>>>>>> [INFO] | | +- >> org.apache.hadoop:hadoop-common:jar:2.2.0:compile >> > >>>>>>> [INFO] | | | +- commons-cli:commons-cli:jar:1.2:compile >> > >>>>>>> [INFO] | | | +- >> org.apache.commons:commons-math:jar:2.1:compile >> > >>>>>>> [INFO] | | | +- commons-io:commons-io:jar:2.4:compile >> > >>>>>>> [INFO] | | | +- >> > commons-logging:commons-logging:jar:1.1.3:compile >> > >>>>>>> [INFO] | | | +- commons-lang:commons-lang:jar:2.6:compile >> > >>>>>>> [INFO] | | | +- >> > >>>>>>> commons-configuration:commons-configuration:jar:1.6:compile >> > >>>>>>> [INFO] | | | | +- >> > >>>>>>> commons-collections:commons-collections:jar:3.2.1:compile >> > >>>>>>> [INFO] | | | | +- >> > >> commons-digester:commons-digester:jar:1.8:compile >> > >>>>>>> [INFO] | | | | | \- >> > >>>>>>> commons-beanutils:commons-beanutils:jar:1.7.0:compile >> > >>>>>>> [INFO] | | | | \- >> > >>>>>>> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile >> > >>>>>>> [INFO] | | | +- org.apache.avro:avro:jar:1.7.4:compile >> > >>>>>>> [INFO] | | | +- >> > >> com.google.protobuf:protobuf-java:jar:2.5.0:compile >> > >>>>>>> [INFO] | | | +- >> org.apache.hadoop:hadoop-auth:jar:2.2.0:compile >> > >>>>>>> [INFO] | | | \- >> > >>>>> org.apache.commons:commons-compress:jar:1.4.1:compile >> > >>>>>>> [INFO] | | | \- org.tukaani:xz:jar:1.0:compile >> > >>>>>>> [INFO] | | +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile >> > >>>>>>> [INFO] | | +- >> > >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile >> > >>>>>>> [INFO] | | | +- >> > >>>>>>> >> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile >> > >>>>>>> [INFO] | | | | +- >> > >>>>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile >> > >>>>>>> [INFO] | | | | | +- com.google.inject:guice:jar:3.0:compile >> > >>>>>>> [INFO] | | | | | | +- >> javax.inject:javax.inject:jar:1:compile >> > >>>>>>> [INFO] | | | | | | \- >> aopalliance:aopalliance:jar:1.0:compile >> > >>>>>>> [INFO] | | | | | +- >> > >>>>>>> >> > >>>>>>> >> > >> >> > >> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile >> > >>>>>>> [INFO] | | | | | | +- >> > >>>>>>> >> > >>>>>>> >> > >> >> > >> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile >> > >>>>>>> [INFO] | | | | | | | +- >> > >>>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile >> > >>>>>>> [INFO] | | | | | | | \- >> > >>>>> com.sun.jersey:jersey-client:jar:1.9:compile >> > >>>>>>> [INFO] | | | | | | \- >> > >>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile >> > >>>>>>> [INFO] | | | | | | +- >> > >>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile >> > >>>>>>> [INFO] | | | | | | | \- >> > >>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile >> > >>>>>>> [INFO] | | | | | | | \- >> > >>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile >> > >>>>>>> [INFO] | | | | | | | \- >> > >>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile >> > >>>>>>> [INFO] | | | | | | +- >> > >>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile >> > >>>>>>> [INFO] | | | | | | | \- >> > >>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile >> > >>>>>>> [INFO] | | | | | | +- >> > >>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile >> > >>>>>>> [INFO] | | | | | | \- >> > >>>>> org.glassfish:javax.servlet:jar:3.1:compile >> > >>>>>>> [INFO] | | | | | +- >> > com.sun.jersey:jersey-server:jar:1.9:compile >> > >>>>>>> [INFO] | | | | | | +- asm:asm:jar:3.1:compile >> > >>>>>>> [INFO] | | | | | | \- >> > >> com.sun.jersey:jersey-core:jar:1.9:compile >> > >>>>>>> [INFO] | | | | | +- >> com.sun.jersey:jersey-json:jar:1.9:compile >> > >>>>>>> [INFO] | | | | | | +- >> > >>>>> org.codehaus.jettison:jettison:jar:1.1:compile >> > >>>>>>> [INFO] | | | | | | | \- stax:stax-api:jar:1.0.1:compile >> > >>>>>>> [INFO] | | | | | | +- >> > >>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile >> > >>>>>>> [INFO] | | | | | | | \- >> > >>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile >> > >>>>>>> [INFO] | | | | | | | \- >> > >>>>>>> javax.activation:activation:jar:1.1:compile >> > >>>>>>> [INFO] | | | | | | +- >> > >>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile >> > >>>>>>> [INFO] | | | | | | \- >> > >>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile >> > >>>>>>> [INFO] | | | | | \- >> > >>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile >> > >>>>>>> [INFO] | | | | \- >> > >>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile >> > >>>>>>> [INFO] | | | \- >> > >>>>>>> >> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile >> > >>>>>>> [INFO] | | +- >> org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile >> > >>>>>>> [INFO] | | +- >> > >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile >> > >>>>>>> [INFO] | | | \- >> > >>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile >> > >>>>>>> [INFO] | | +- >> > >>>>>>> >> > org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile >> > >>>>>>> [INFO] | | \- >> > >> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile >> > >>>>>>> [INFO] | +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile >> > >>>>>>> [INFO] | | +- commons-codec:commons-codec:jar:1.3:compile >> > >>>>>>> [INFO] | | \- >> > commons-httpclient:commons-httpclient:jar:3.1:compile >> > >>>>>>> [INFO] | +- >> org.apache.curator:curator-recipes:jar:2.4.0:compile >> > >>>>>>> [INFO] | | +- >> > >> org.apache.curator:curator-framework:jar:2.4.0:compile >> > >>>>>>> [INFO] | | | \- >> > >> org.apache.curator:curator-client:jar:2.4.0:compile >> > >>>>>>> [INFO] | | \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile >> > >>>>>>> [INFO] | | \- jline:jline:jar:0.9.94:compile >> > >>>>>>> [INFO] | +- >> > >> org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile >> > >>>>>>> [INFO] | | +- >> > >>>>>>> >> > >> >> > >> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile >> > >>>>>>> [INFO] | | +- >> > >>>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile >> > >>>>>>> [INFO] | | | +- >> > >>>>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile >> > >>>>>>> [INFO] | | | \- >> > >>>>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile >> > >>>>>>> [INFO] | | \- >> > >>>>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile >> > >>>>>>> [INFO] | | \- >> > >>>>>>> >> > >>>>>>> >> > >> >> > >> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile >> > >>>>>>> [INFO] | | \- >> > >>>>>>> >> > >> >> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile >> > >>>>>>> [INFO] | +- >> > >>>>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile >> > >>>>>>> [INFO] | +- >> > >> org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile >> > >>>>>>> [INFO] | +- >> > >>>>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile >> > >>>>>>> [INFO] | | +- >> > >>>>>>> >> > org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile >> > >>>>>>> [INFO] | | +- >> > >>>>>>> >> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile >> > >>>>>>> [INFO] | | \- >> > >>>>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile >> > >>>>>>> [INFO] | | \- >> > >>>>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile >> > >>>>>>> [INFO] | +- com.google.guava:guava:jar:16.0:compile >> > >>>>>>> d >> > >>>>>>> >> > >>>>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov < >> > dlie...@gmail.com >> > >>>>>>> wrote: >> > >>>>>>> >> > >>>>>>>> looks like it is also requested by mahout-math, wonder what is >> > using >> > >>>>> it >> > >>>>>>>> there. >> > >>>>>>>> >> > >>>>>>>> At very least, it needs to be synchronized to the one currently >> > used >> > >>>>> by >> > >>>>>>>> spark. >> > >>>>>>>> >> > >>>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ >> > >>>>> mahout-hadoop >> > >>>>>>>> --- >> > >>>>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT >> > >>>>>>>> *[INFO] +- >> org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile* >> > >>>>>>>> [INFO] | +- org.apache.commons:commons-math3:jar:3.2:compile >> > >>>>>>>> *[INFO] | +- com.google.guava:guava:jar:16.0:compile* >> > >>>>>>>> [INFO] | \- com.tdunning:t-digest:jar:2.0.2:compile >> > >>>>>>>> [INFO] +- >> > >>>>> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test >> > >>>>>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile >> > >>>>>>>> [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel < >> > p...@occamsmachete.com> >> > >>>>>>> wrote: >> > >>>>>>>>> Looks like Guava is in Spark. >> > >>>>>>>>> >> > >>>>>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel < >> p...@occamsmachete.com> >> > >>>>> wrote: >> > >>>>>>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds >> > like >> > >>>>> this >> > >>>>>>>>> would not be included since I think it was taken from the >> > mrlegacy >> > >>>>> jar. >> > >>>>>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov < >> > dlie...@gmail.com> >> > >>>>>>> wrote: >> > >>>>>>>>> ---------- Forwarded message ---------- >> > >>>>>>>>> From: "Pat Ferrel" <p...@occamsmachete.com> >> > >>>>>>>>> Date: Jan 25, 2015 9:39 AM >> > >>>>>>>>> Subject: Re: Codebase refactoring proposal >> > >>>>>>>>> To: <dev@mahout.apache.org> >> > >>>>>>>>> Cc: >> > >>>>>>>>> >> > >>>>>>>>>> When you get a chance a PR would be good. >> > >>>>>>>>> Yes, it would. And not just for that. >> > >>>>>>>>> >> > >>>>>>>>>> As I understand it you are putting some class jars somewhere >> in >> > >> the >> > >>>>>>>>> classpath. Where? How? >> > >>>>>>>>> /bin/mahout >> > >>>>>>>>> >> > >>>>>>>>> (Computes 2 different classpaths. See 'bin/mahout classpath' >> vs. >> > >>>>>>>>> 'bin/mahout -spark'.) >> > >>>>>>>>> >> > >>>>>>>>> If i interpret current shell code there correctky, legacy path >> > >> tries >> > >>>>> to >> > >>>>>>>>> use >> > >>>>>>>>> examples assemblies if not packaged, or /lib if packaged. True >> > >>>>>>> motivation >> > >>>>>>>>> of that significantly predates 2010 and i suspect only Benson >> > knows >> > >>>>>>> whole >> > >>>>>>>>> true intent there. >> > >>>>>>>>> >> > >>>>>>>>> The spark path, which is really a quick hack of the script, >> tries >> > >> to >> > >>>>> get >> > >>>>>>>>> only selected mahout jars and locally instlalled spark >> classpath >> > >>>>> which i >> > >>>>>>>>> guess is just the shaded spark jar in recent spark releases. >> It >> > >> also >> > >>>>>>>>> apparently tries to include /libs/*, which is never compiled >> in >> > >>>>>>> unpackaged >> > >>>>>>>>> version, and now i think it is a bug it is included because >> > >> /libs/* >> > >>>>> is >> > >>>>>>>>> apparently legacy packaging, and shouldnt be used in spark >> jobs >> > >>>>> with a >> > >>>>>>>>> wildcard. I cant beleive how lazy i am, i still did not find >> time >> > >> to >> > >>>>>>>>> understand mahout build in all cases. >> > >>>>>>>>> >> > >>>>>>>>> I am not even sure if packaged mahout will work with spark, >> > >> honestly, >> > >>>>>>>>> because of the /lib. Never tried that, since i mostly use >> > >> application >> > >>>>>>>>> embedding techniques. >> > >>>>>>>>> >> > >>>>>>>>> The same solution may apply to adding external dependencies >> and >> > >>>>> removing >> > >>>>>>>>> the assembly in the Spark module. Which would leave only one >> > major >> > >>>>> build >> > >>>>>>>>> issue afaik. >> > >>>>>>>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov < >> > dlie...@gmail.com >> > >>>>>>>>> wrote: >> > >>>>>>>>>> No, no PR. Only experiment on private. But i believe i >> > >> sufficiently >> > >>>>>>>>> defined >> > >>>>>>>>>> what i want to do in order to gauge if we may want to >> advance it >> > >>>>> some >> > >>>>>>>>> time >> > >>>>>>>>>> later. Goal is much lighter dependency for spark code. >> Eliminate >> > >>>>>>>>> everything >> > >>>>>>>>>> that is not compile-time dependent. (and a lot of it is thru >> > >> legacy >> > >>>>> MR >> > >>>>>>>>> code >> > >>>>>>>>>> which we of course don't use). >> > >>>>>>>>>> >> > >>>>>>>>>> Cant say i understand the remaining issues you are talking >> about >> > >>>>>>> though. >> > >>>>>>>>>> If you are talking about compiling lib or shaded assembly, >> no, >> > >> this >> > >>>>>>>>> doesn't >> > >>>>>>>>>> do anything about it. Although point is, as it stands, the >> > algebra >> > >>>>> and >> > >>>>>>>>>> shell don't have any external dependencies but spark and >> these 4 >> > >>>>> (5?) >> > >>>>>>>>>> mahout jars so they technically don't even need an assembly >> (as >> > >>>>>>>>>> demonstrated). >> > >>>>>>>>>> >> > >>>>>>>>>> As i said, it seems driver code is the only one that may need >> > some >> > >>>>>>>>> external >> > >>>>>>>>>> dependencies, but that's a different scenario from those i am >> > >>>>> talking >> > >>>>>>>>>> about. But i am relatively happy with having the first two >> > working >> > >>>>>>>>> nicely >> > >>>>>>>>>> at this point. >> > >>>>>>>>>> >> > >>>>>>>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel < >> > >> p...@occamsmachete.com> >> > >>>>>>>>> wrote: >> > >>>>>>>>>>> +1 >> > >>>>>>>>>>> >> > >>>>>>>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It >> > >> would >> > >>>>> be >> > >>>>>>>>> nice >> > >>>>>>>>>>> to see how you’ve structured that in case we can use the >> same >> > >>>>> model to >> > >>>>>>>>>>> solve the two remaining refactoring issues. >> > >>>>>>>>>>> 1) external dependencies in the spark module >> > >>>>>>>>>>> 2) no spark or h2o in the release artifacts. >> > >>>>>>>>>>> >> > >>>>>>>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn < >> squ...@gatech.edu> >> > >>>>> wrote: >> > >>>>>>>>>>> Also +1 >> > >>>>>>>>>>> >> > >>>>>>>>>>> iPhone'd >> > >>>>>>>>>>> >> > >>>>>>>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo < >> ap....@outlook.com >> > > >> > >>>>>>> wrote: >> > >>>>>>>>>>>> +1 >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> <div>-------- Original message --------</div><div>From: >> > Dmitriy >> > >>>>>>>>> Lyubimov >> > >>>>>>>>>>> <dlie...@gmail.com> </div><div>Date:01/23/2015 6:06 PM >> > >>>>> (GMT-05:00) >> > >>>>>>>>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject: >> > >> Codebase >> > >>>>>>>>>>> refactoring proposal </div><div> >> > >>>>>>>>>>>> </div> >> > >>>>>>>>>>>> So right now mahout-spark depends on mr-legacy. >> > >>>>>>>>>>>> I did quick refactoring and it turns out it only >> _irrevocably_ >> > >>>>>>> depends >> > >>>>>>>>> on >> > >>>>>>>>>>>> the following classes there: >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and >> > >> VarintWritable, >> > >>>>>>> and >> > >>>>>>>>>>> ... >> > >>>>>>>>>>>> *sigh* o.a.m.common.Pair >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> So I just dropped those five classes into new a new tiny >> > >>>>>>>>> mahout-hadoop >> > >>>>>>>>>>>> module (to signify stuff that is directly relevant to >> > >> serializing >> > >>>>>>>>> thigns >> > >>>>>>>>>>> to >> > >>>>>>>>>>>> DFS API) and completely removed mrlegacy and its transients >> > from >> > >>>>>>> spark >> > >>>>>>>>>>> and >> > >>>>>>>>>>>> spark-shell dependencies. >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> So non-cli applications (shell scripts and embedded api >> use) >> > >>>>> actually >> > >>>>>>>>>>> only >> > >>>>>>>>>>>> need spark dependencies (which come from SPARK_HOME >> classpath, >> > >> of >> > >>>>>>>>> course) >> > >>>>>>>>>>>> and mahout jars (mahout-spark, mahout-math(-scala), >> > >> mahout-hadoop >> > >>>>> and >> > >>>>>>>>>>>> optionally mahout-spark-shell (for running shell)). >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> This of course still doesn't address driver problems that >> want >> > >> to >> > >>>>>>>>> throw >> > >>>>>>>>>>>> more stuff into front-end classpath (such as cli parser) >> but >> > at >> > >>>>> least >> > >>>>>>>>> it >> > >>>>>>>>>>>> renders transitive luggage of mr-legacy (and the size of >> > >>>>>>>>> worker-shipped >> > >>>>>>>>>>>> jars) much more tolerable. >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> How does that sound? >> > >>>>>>>>> >> > >> >> > >> >> > >> > >> > >> > >