graph = graft, sorry. Graft just the AtB class into 0.12 codebase. On Mon, May 2, 2016 at 9:06 AM, Dmitriy Lyubimov <[email protected]> wrote:
> ok. > > Nikaash, > could you perhaps do one more experiment and graph the 0.10 a'b code into > 0.12 code (or whatever branch you say is not working the same) so we could > quite confirm that the culprit change is indeed AB'? > > thank you very much. > > -d > > On Mon, May 2, 2016 at 3:35 AM, Nikaash Puri <[email protected]> > wrote: > >> Hi, >> >> I tried commenting out those lines and it did marginally improve the >> performance. Although, the 0.10 version still significantly outperforms it. >> >> Here is a screenshot of the saveAsTextFile job (attached as selection1). >> The AtB step took about 34 mins, which is significantly more than using >> 0.10. Similarly, the saveAsTextFile action takes about 9 mins as well. >> >> The selection2 file is a screenshot of the flatMap at AtB.scala job, >> which ran for 34 minutes, >> >> Also, I'm using multiple indicators. As of Mahout 0.10, the first AtB >> would take time, while subsequent such operations for the other indicators >> would be orders of magnitudes faster. In the current job, the subsequent >> AtB operations take time similar to the first one. >> >> A snapshot of my code is as follows: >> >> var existingRowIDs: Option[BiDictionary] = None >> >> // The first action named in the sequence is the "primary" action and begins >> to fill up the user dictionary >> for (actionDescription <- actionInput) { >> // grab the path to actions >> val action: IndexedDataset = SparkEngine.indexedDatasetDFSReadElements( >> actionDescription._2, >> schema = DefaultIndexedDatasetElementReadSchema, >> existingRowIDs = existingRowIDs) >> existingRowIDs = Some(action.rowIDs) >> >> ... >> } >> >> which seems fairly standard, so I hope I'm not making a mistake here. >> >> It looks like the 0.11 onward version is using computeAtBZipped3 for >> performing the multiplication in atb_nograph_mmul unlike 0.10 which was >> using atb_nograph. Though I'm not really sure whether that makes much of a >> difference. >> >> Thank you, >> Nikaash Puri >> >> On Sat, Apr 30, 2016 at 12:36 AM Pat Ferrel <[email protected]> >> wrote: >> >>> Right, will do. But Nakaash if you could just comment out those lines >>> and see if it has an effect it would be informative and even perhaps solve >>> your problem sooner than my changes. No great rush. Playing around with >>> different values, as Dmitriy says, might yield better results and for that >>> you can mess with the code or wait for my changes. >>> >>> Yeah, it’s fast enough in most cases. The main work is the optimized >>> A’A, A’B stuff in the BLAS optimizer Dmitriy put in. It is something like >>> 10x faster than a similar algo in Hadoop MR. This particular calc and >>> generalization is not in any other Spark or now Flink lib that I know of. >>> >>> >>> On Apr 29, 2016, at 11:24 AM, Dmitriy Lyubimov <[email protected]> >>> wrote: >>> >>> Nikaash, >>> >>> yes unfortunately you may need to play with parallelism for your >>> particular >>> load/cluster manually to get the best out of it. I guess Pat will be >>> adding >>> the option. >>> >>> On Fri, Apr 29, 2016 at 11:14 AM, Nikaash Puri <[email protected]> >>> wrote: >>> >>> > Hi, >>> > >>> > Sure, I’ll do some more detailed analysis of the jobs on the UI and >>> share >>> > screenshots if possible. >>> > >>> > Pat, yup, I’ll only be able to get to this on Monday, though. I’ll >>> comment >>> > out the line and see the difference in performance. >>> > >>> > Thanks so much for helping guys, I really appreciate it. >>> > >>> > Also, the algorithm implementation for LLR is extremely performant, at >>> > least as of Mahout 0.10. I ran some tests for around 61 days of data >>> (which >>> > in our case is a fair amount) and the model was built in about 20 >>> minutes, >>> > which is pretty amazing. This was using a pretty decent sized cluster, >>> > though. >>> > >>> > Thank you, >>> > Nikaash Puri >>> > >>> > On 29-Apr-2016, at 10:18 PM, Pat Ferrel <[email protected]> wrote: >>> > >>> > There are some other changes I want to make for the next rev so I’ll do >>> > that. >>> > >>> > Nikaash, it would still be nice to verify this fixes your problem, >>> also if >>> > you want to create a Jira it will guarantee I don’t forget. >>> > >>> > >>> > On Apr 29, 2016, at 9:23 AM, Dmitriy Lyubimov <[email protected]> >>> wrote: >>> > >>> > yes -- i would do it as an optional option -- just like par does -- do >>> > nothing; try auto, or try exact number of splits >>> > >>> > On Fri, Apr 29, 2016 at 9:15 AM, Pat Ferrel <[email protected]> >>> wrote: >>> > >>> >> It’s certainly easy to put this in the driver, taking it out of the >>> algo. >>> >> >>> >> Dmitriy, is it a candidate for an Option param to the algo? That would >>> >> catch cases where people rely on it now (like my old DStream example) >>> but >>> >> easily allow it to be overridden to None to imitate pre 0.11, or >>> passed in >>> >> when the app knows better. >>> >> >>> >> Nikaash, are you in a position to comment out the .par(auto=true) and >>> see >>> >> if it makes a difference? >>> >> >>> >> >>> >> On Apr 29, 2016, at 8:53 AM, Dmitriy Lyubimov <[email protected]> >>> wrote: >>> >> >>> >> can you please look into spark UI and write down how many split the >>> job >>> >> generates in the first stage of the pipeline, or anywhere else there's >>> >> signficant variation in # of splits in both cases? >>> >> >>> >> the row similarity is a very short pipeline (in comparison with what >>> would >>> >> normally be on average). so only the first input re-splitting is >>> critical. >>> >> >>> >> The splitting along the products is adjusted by optimizer >>> automatically to >>> >> match the amount of data segments observed on average in the input(s). >>> >> e.g. >>> >> if uyou compute val C = A %*% B and A has 500 elements per split and >>> B has >>> >> 5000 elements per split then C would approximately have 5000 elements >>> per >>> >> split (the larger average in binary operator cases). That's >>> approximately >>> >> how it works. >>> >> >>> >> However, the par() that has been added, is messing with initial >>> >> parallelism >>> >> which would naturally affect the rest of pipeline per above. I now >>> doubt >>> >> it >>> >> was a good thing -- when i suggested Pat to try this, i did not mean >>> to >>> >> put >>> >> it _inside_ the algorithm itself, rather, into the accurate input >>> >> preparation code in his particular case. However, I don't think it >>> will >>> >> work in any given case. Actually sweet spot parallelism for >>> >> multioplication >>> >> unfortunately depends on tons of factors -- network bandwidth and >>> hardware >>> >> configuration, so it is difficult to give it a good guess universally. >>> >> More >>> >> likely, for cli-based prepackaged algorithms (I don't use CLI but >>> rather >>> >> assemble pipelines in scala via scripting and scala application code) >>> the >>> >> initial paralellization adjustment options should probably be >>> provided to >>> >> CLI. >>> >> >>> >> >>> > >>> > >>> > >>> >>> >
