Hi, Ok, so another interesting result. When I compute cross-cooccurrences with user profile attributes that have high cardinality (for instance city), the AtB step completes in roughly 11 minutes on some data set. Now, if I do the same calculation on a profile attribute such as gender having simply two distinct values, the AtB step is much slower. In my case, the profile attribute I was using had a small number of distinct values.
Could this be because of the indicator matrix no longer remaining sparse (just venturing a guess here)? These results are from Mahout 0.10 and Spark 1.2.0 Thank you, Nikaash Puri On Tue, May 3, 2016 at 6:26 AM Dmitriy Lyubimov <[email protected]> wrote: > graph = graft, sorry. Graft just the AtB class into 0.12 codebase. > > On Mon, May 2, 2016 at 9:06 AM, Dmitriy Lyubimov <[email protected]> > wrote: > > > ok. > > > > Nikaash, > > could you perhaps do one more experiment and graph the 0.10 a'b code into > > 0.12 code (or whatever branch you say is not working the same) so we > could > > quite confirm that the culprit change is indeed AB'? > > > > thank you very much. > > > > -d > > > > On Mon, May 2, 2016 at 3:35 AM, Nikaash Puri <[email protected]> > > wrote: > > > >> Hi, > >> > >> I tried commenting out those lines and it did marginally improve the > >> performance. Although, the 0.10 version still significantly outperforms > it. > >> > >> Here is a screenshot of the saveAsTextFile job (attached as selection1). > >> The AtB step took about 34 mins, which is significantly more than using > >> 0.10. Similarly, the saveAsTextFile action takes about 9 mins as well. > >> > >> The selection2 file is a screenshot of the flatMap at AtB.scala job, > >> which ran for 34 minutes, > >> > >> Also, I'm using multiple indicators. As of Mahout 0.10, the first AtB > >> would take time, while subsequent such operations for the other > indicators > >> would be orders of magnitudes faster. In the current job, the subsequent > >> AtB operations take time similar to the first one. > >> > >> A snapshot of my code is as follows: > >> > >> var existingRowIDs: Option[BiDictionary] = None > >> > >> // The first action named in the sequence is the "primary" action and > begins to fill up the user dictionary > >> for (actionDescription <- actionInput) { > >> // grab the path to actions > >> val action: IndexedDataset = > SparkEngine.indexedDatasetDFSReadElements( > >> actionDescription._2, > >> schema = DefaultIndexedDatasetElementReadSchema, > >> existingRowIDs = existingRowIDs) > >> existingRowIDs = Some(action.rowIDs) > >> > >> ... > >> } > >> > >> which seems fairly standard, so I hope I'm not making a mistake here. > >> > >> It looks like the 0.11 onward version is using computeAtBZipped3 for > >> performing the multiplication in atb_nograph_mmul unlike 0.10 which was > >> using atb_nograph. Though I'm not really sure whether that makes much > of a > >> difference. > >> > >> Thank you, > >> Nikaash Puri > >> > >> On Sat, Apr 30, 2016 at 12:36 AM Pat Ferrel <[email protected]> > >> wrote: > >> > >>> Right, will do. But Nakaash if you could just comment out those lines > >>> and see if it has an effect it would be informative and even perhaps > solve > >>> your problem sooner than my changes. No great rush. Playing around with > >>> different values, as Dmitriy says, might yield better results and for > that > >>> you can mess with the code or wait for my changes. > >>> > >>> Yeah, it’s fast enough in most cases. The main work is the optimized > >>> A’A, A’B stuff in the BLAS optimizer Dmitriy put in. It is something > like > >>> 10x faster than a similar algo in Hadoop MR. This particular calc and > >>> generalization is not in any other Spark or now Flink lib that I know > of. > >>> > >>> > >>> On Apr 29, 2016, at 11:24 AM, Dmitriy Lyubimov <[email protected]> > >>> wrote: > >>> > >>> Nikaash, > >>> > >>> yes unfortunately you may need to play with parallelism for your > >>> particular > >>> load/cluster manually to get the best out of it. I guess Pat will be > >>> adding > >>> the option. > >>> > >>> On Fri, Apr 29, 2016 at 11:14 AM, Nikaash Puri <[email protected]> > >>> wrote: > >>> > >>> > Hi, > >>> > > >>> > Sure, I’ll do some more detailed analysis of the jobs on the UI and > >>> share > >>> > screenshots if possible. > >>> > > >>> > Pat, yup, I’ll only be able to get to this on Monday, though. I’ll > >>> comment > >>> > out the line and see the difference in performance. > >>> > > >>> > Thanks so much for helping guys, I really appreciate it. > >>> > > >>> > Also, the algorithm implementation for LLR is extremely performant, > at > >>> > least as of Mahout 0.10. I ran some tests for around 61 days of data > >>> (which > >>> > in our case is a fair amount) and the model was built in about 20 > >>> minutes, > >>> > which is pretty amazing. This was using a pretty decent sized > cluster, > >>> > though. > >>> > > >>> > Thank you, > >>> > Nikaash Puri > >>> > > >>> > On 29-Apr-2016, at 10:18 PM, Pat Ferrel <[email protected]> > wrote: > >>> > > >>> > There are some other changes I want to make for the next rev so I’ll > do > >>> > that. > >>> > > >>> > Nikaash, it would still be nice to verify this fixes your problem, > >>> also if > >>> > you want to create a Jira it will guarantee I don’t forget. > >>> > > >>> > > >>> > On Apr 29, 2016, at 9:23 AM, Dmitriy Lyubimov <[email protected]> > >>> wrote: > >>> > > >>> > yes -- i would do it as an optional option -- just like par does -- > do > >>> > nothing; try auto, or try exact number of splits > >>> > > >>> > On Fri, Apr 29, 2016 at 9:15 AM, Pat Ferrel <[email protected]> > >>> wrote: > >>> > > >>> >> It’s certainly easy to put this in the driver, taking it out of the > >>> algo. > >>> >> > >>> >> Dmitriy, is it a candidate for an Option param to the algo? That > would > >>> >> catch cases where people rely on it now (like my old DStream > example) > >>> but > >>> >> easily allow it to be overridden to None to imitate pre 0.11, or > >>> passed in > >>> >> when the app knows better. > >>> >> > >>> >> Nikaash, are you in a position to comment out the .par(auto=true) > and > >>> see > >>> >> if it makes a difference? > >>> >> > >>> >> > >>> >> On Apr 29, 2016, at 8:53 AM, Dmitriy Lyubimov <[email protected]> > >>> wrote: > >>> >> > >>> >> can you please look into spark UI and write down how many split the > >>> job > >>> >> generates in the first stage of the pipeline, or anywhere else > there's > >>> >> signficant variation in # of splits in both cases? > >>> >> > >>> >> the row similarity is a very short pipeline (in comparison with what > >>> would > >>> >> normally be on average). so only the first input re-splitting is > >>> critical. > >>> >> > >>> >> The splitting along the products is adjusted by optimizer > >>> automatically to > >>> >> match the amount of data segments observed on average in the > input(s). > >>> >> e.g. > >>> >> if uyou compute val C = A %*% B and A has 500 elements per split and > >>> B has > >>> >> 5000 elements per split then C would approximately have 5000 > elements > >>> per > >>> >> split (the larger average in binary operator cases). That's > >>> approximately > >>> >> how it works. > >>> >> > >>> >> However, the par() that has been added, is messing with initial > >>> >> parallelism > >>> >> which would naturally affect the rest of pipeline per above. I now > >>> doubt > >>> >> it > >>> >> was a good thing -- when i suggested Pat to try this, i did not mean > >>> to > >>> >> put > >>> >> it _inside_ the algorithm itself, rather, into the accurate input > >>> >> preparation code in his particular case. However, I don't think it > >>> will > >>> >> work in any given case. Actually sweet spot parallelism for > >>> >> multioplication > >>> >> unfortunately depends on tons of factors -- network bandwidth and > >>> hardware > >>> >> configuration, so it is difficult to give it a good guess > universally. > >>> >> More > >>> >> likely, for cli-based prepackaged algorithms (I don't use CLI but > >>> rather > >>> >> assemble pipelines in scala via scripting and scala application > code) > >>> the > >>> >> initial paralellization adjustment options should probably be > >>> provided to > >>> >> CLI. > >>> >> > >>> >> > >>> > > >>> > > >>> > > >>> > >>> > > >
