ok. Nikaash, could you perhaps do one more experiment and graph the 0.10 a'b code into 0.12 code (or whatever branch you say is not working the same) so we could quite confirm that the culprit change is indeed AB'?
thank you very much. -d On Mon, May 2, 2016 at 3:35 AM, Nikaash Puri <[email protected]> wrote: > Hi, > > I tried commenting out those lines and it did marginally improve the > performance. Although, the 0.10 version still significantly outperforms it. > > Here is a screenshot of the saveAsTextFile job (attached as selection1). > The AtB step took about 34 mins, which is significantly more than using > 0.10. Similarly, the saveAsTextFile action takes about 9 mins as well. > > The selection2 file is a screenshot of the flatMap at AtB.scala job, which > ran for 34 minutes, > > Also, I'm using multiple indicators. As of Mahout 0.10, the first AtB > would take time, while subsequent such operations for the other indicators > would be orders of magnitudes faster. In the current job, the subsequent > AtB operations take time similar to the first one. > > A snapshot of my code is as follows: > > var existingRowIDs: Option[BiDictionary] = None > > // The first action named in the sequence is the "primary" action and begins > to fill up the user dictionary > for (actionDescription <- actionInput) { > // grab the path to actions > val action: IndexedDataset = SparkEngine.indexedDatasetDFSReadElements( > actionDescription._2, > schema = DefaultIndexedDatasetElementReadSchema, > existingRowIDs = existingRowIDs) > existingRowIDs = Some(action.rowIDs) > > ... > } > > which seems fairly standard, so I hope I'm not making a mistake here. > > It looks like the 0.11 onward version is using computeAtBZipped3 for > performing the multiplication in atb_nograph_mmul unlike 0.10 which was > using atb_nograph. Though I'm not really sure whether that makes much of a > difference. > > Thank you, > Nikaash Puri > > On Sat, Apr 30, 2016 at 12:36 AM Pat Ferrel <[email protected]> wrote: > >> Right, will do. But Nakaash if you could just comment out those lines and >> see if it has an effect it would be informative and even perhaps solve your >> problem sooner than my changes. No great rush. Playing around with >> different values, as Dmitriy says, might yield better results and for that >> you can mess with the code or wait for my changes. >> >> Yeah, it’s fast enough in most cases. The main work is the optimized A’A, >> A’B stuff in the BLAS optimizer Dmitriy put in. It is something like 10x >> faster than a similar algo in Hadoop MR. This particular calc and >> generalization is not in any other Spark or now Flink lib that I know of. >> >> >> On Apr 29, 2016, at 11:24 AM, Dmitriy Lyubimov <[email protected]> wrote: >> >> Nikaash, >> >> yes unfortunately you may need to play with parallelism for your >> particular >> load/cluster manually to get the best out of it. I guess Pat will be >> adding >> the option. >> >> On Fri, Apr 29, 2016 at 11:14 AM, Nikaash Puri <[email protected]> >> wrote: >> >> > Hi, >> > >> > Sure, I’ll do some more detailed analysis of the jobs on the UI and >> share >> > screenshots if possible. >> > >> > Pat, yup, I’ll only be able to get to this on Monday, though. I’ll >> comment >> > out the line and see the difference in performance. >> > >> > Thanks so much for helping guys, I really appreciate it. >> > >> > Also, the algorithm implementation for LLR is extremely performant, at >> > least as of Mahout 0.10. I ran some tests for around 61 days of data >> (which >> > in our case is a fair amount) and the model was built in about 20 >> minutes, >> > which is pretty amazing. This was using a pretty decent sized cluster, >> > though. >> > >> > Thank you, >> > Nikaash Puri >> > >> > On 29-Apr-2016, at 10:18 PM, Pat Ferrel <[email protected]> wrote: >> > >> > There are some other changes I want to make for the next rev so I’ll do >> > that. >> > >> > Nikaash, it would still be nice to verify this fixes your problem, also >> if >> > you want to create a Jira it will guarantee I don’t forget. >> > >> > >> > On Apr 29, 2016, at 9:23 AM, Dmitriy Lyubimov <[email protected]> >> wrote: >> > >> > yes -- i would do it as an optional option -- just like par does -- do >> > nothing; try auto, or try exact number of splits >> > >> > On Fri, Apr 29, 2016 at 9:15 AM, Pat Ferrel <[email protected]> >> wrote: >> > >> >> It’s certainly easy to put this in the driver, taking it out of the >> algo. >> >> >> >> Dmitriy, is it a candidate for an Option param to the algo? That would >> >> catch cases where people rely on it now (like my old DStream example) >> but >> >> easily allow it to be overridden to None to imitate pre 0.11, or >> passed in >> >> when the app knows better. >> >> >> >> Nikaash, are you in a position to comment out the .par(auto=true) and >> see >> >> if it makes a difference? >> >> >> >> >> >> On Apr 29, 2016, at 8:53 AM, Dmitriy Lyubimov <[email protected]> >> wrote: >> >> >> >> can you please look into spark UI and write down how many split the job >> >> generates in the first stage of the pipeline, or anywhere else there's >> >> signficant variation in # of splits in both cases? >> >> >> >> the row similarity is a very short pipeline (in comparison with what >> would >> >> normally be on average). so only the first input re-splitting is >> critical. >> >> >> >> The splitting along the products is adjusted by optimizer >> automatically to >> >> match the amount of data segments observed on average in the input(s). >> >> e.g. >> >> if uyou compute val C = A %*% B and A has 500 elements per split and B >> has >> >> 5000 elements per split then C would approximately have 5000 elements >> per >> >> split (the larger average in binary operator cases). That's >> approximately >> >> how it works. >> >> >> >> However, the par() that has been added, is messing with initial >> >> parallelism >> >> which would naturally affect the rest of pipeline per above. I now >> doubt >> >> it >> >> was a good thing -- when i suggested Pat to try this, i did not mean to >> >> put >> >> it _inside_ the algorithm itself, rather, into the accurate input >> >> preparation code in his particular case. However, I don't think it will >> >> work in any given case. Actually sweet spot parallelism for >> >> multioplication >> >> unfortunately depends on tons of factors -- network bandwidth and >> hardware >> >> configuration, so it is difficult to give it a good guess universally. >> >> More >> >> likely, for cli-based prepackaged algorithms (I don't use CLI but >> rather >> >> assemble pipelines in scala via scripting and scala application code) >> the >> >> initial paralellization adjustment options should probably be provided >> to >> >> CLI. >> >> >> >> >> > >> > >> > >> >>
