Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

Nikaash Puri Wed, 04 May 2016 21:42:08 -0700

Hi,

Ok, so another interesting result. When I compute cross-cooccurrences with
user profile attributes that have high cardinality (for instance city), the
AtB step completes in roughly 11 minutes on some data set. Now, if I do the
same calculation on a profile attribute such as gender having simply two
distinct values, the AtB step is much slower. In my case, the profile
attribute I was using had a small number of distinct values.


Could this be because of the indicator matrix no longer remaining sparse
(just venturing a guess here)?

These results are from Mahout 0.10 and Spark 1.2.0

Thank you,
Nikaash Puri


On Tue, May 3, 2016 at 6:26 AM Dmitriy Lyubimov <[email protected]> wrote:

> graph = graft, sorry. Graft just the AtB class into 0.12 codebase.
>
> On Mon, May 2, 2016 at 9:06 AM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > ok.
> >
> > Nikaash,
> > could you perhaps do one more experiment and graph the 0.10 a'b code into
> > 0.12 code (or whatever branch you say is not working the same) so we
> could
> > quite confirm that the culprit change is indeed AB'?
> >
> > thank you very much.
> >
> > -d
> >
> > On Mon, May 2, 2016 at 3:35 AM, Nikaash Puri <[email protected]>
> > wrote:
> >
> >> Hi,
> >>
> >> I tried commenting out those lines and it did marginally improve the
> >> performance. Although, the 0.10 version still significantly outperforms
> it.
> >>
> >> Here is a screenshot of the saveAsTextFile job (attached as selection1).
> >> The AtB step took about 34 mins, which is significantly more than using
> >> 0.10. Similarly, the saveAsTextFile action takes about 9 mins as well.
> >>
> >> The selection2 file is a screenshot of the flatMap at AtB.scala job,
> >> which ran for 34 minutes,
> >>
> >> Also, I'm using multiple indicators. As of Mahout 0.10, the first AtB
> >> would take time, while subsequent such operations for the other
> indicators
> >> would be orders of magnitudes faster. In the current job, the subsequent
> >> AtB operations take time similar to the first one.
> >>
> >> A snapshot of my code is as follows:
> >>
> >> var existingRowIDs: Option[BiDictionary] = None
> >>
> >> // The first action named in the sequence is the "primary" action and
> begins to fill up the user dictionary
> >> for (actionDescription <- actionInput) {
> >>   // grab the path to actions
> >>   val action: IndexedDataset =
> SparkEngine.indexedDatasetDFSReadElements(
> >>     actionDescription._2,
> >>     schema = DefaultIndexedDatasetElementReadSchema,
> >>     existingRowIDs = existingRowIDs)
> >>   existingRowIDs = Some(action.rowIDs)
> >>
> >>   ...
> >> }
> >>
> >> which seems fairly standard, so I hope I'm not making a mistake here.
> >>
> >> It looks like the 0.11 onward version is using computeAtBZipped3 for
> >> performing the multiplication in atb_nograph_mmul unlike 0.10 which was
> >> using atb_nograph. Though I'm not really sure whether that makes much
> of a
> >> difference.
> >>
> >> Thank you,
> >> Nikaash Puri
> >>
> >> On Sat, Apr 30, 2016 at 12:36 AM Pat Ferrel <[email protected]>
> >> wrote:
> >>
> >>> Right, will do. But Nakaash if you could just comment out those lines
> >>> and see if it has an effect it would be informative and even perhaps
> solve
> >>> your problem sooner than my changes. No great rush. Playing around with
> >>> different values, as Dmitriy says, might yield better results and for
> that
> >>> you can mess with the code or wait for my changes.
> >>>
> >>> Yeah, it’s fast enough in most cases. The main work is the optimized
> >>> A’A, A’B stuff in the BLAS optimizer Dmitriy put in. It is something
> like
> >>> 10x faster than a similar algo in Hadoop MR. This particular calc and
> >>> generalization is not in any other Spark or now Flink lib that I know
> of.
> >>>
> >>>
> >>> On Apr 29, 2016, at 11:24 AM, Dmitriy Lyubimov <[email protected]>
> >>> wrote:
> >>>
> >>> Nikaash,
> >>>
> >>> yes unfortunately you may need to play with parallelism for your
> >>> particular
> >>> load/cluster manually to get the best out of it. I guess Pat will be
> >>> adding
> >>> the option.
> >>>
> >>> On Fri, Apr 29, 2016 at 11:14 AM, Nikaash Puri <[email protected]>
> >>> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> > Sure, I’ll do some more detailed analysis of the jobs on the UI and
> >>> share
> >>> > screenshots if possible.
> >>> >
> >>> > Pat, yup, I’ll only be able to get to this on Monday, though. I’ll
> >>> comment
> >>> > out the line and see the difference in performance.
> >>> >
> >>> > Thanks so much for helping guys, I really appreciate it.
> >>> >
> >>> > Also, the algorithm implementation for LLR is extremely performant,
> at
> >>> > least as of Mahout 0.10. I ran some tests for around 61 days of data
> >>> (which
> >>> > in our case is a fair amount) and the model was built in about 20
> >>> minutes,
> >>> > which is pretty amazing. This was using a pretty decent sized
> cluster,
> >>> > though.
> >>> >
> >>> > Thank you,
> >>> > Nikaash Puri
> >>> >
> >>> > On 29-Apr-2016, at 10:18 PM, Pat Ferrel <[email protected]>
> wrote:
> >>> >
> >>> > There are some other changes I want to make for the next rev so I’ll
> do
> >>> > that.
> >>> >
> >>> > Nikaash, it would still be nice to verify this fixes your problem,
> >>> also if
> >>> > you want to create a Jira it will guarantee I don’t forget.
> >>> >
> >>> >
> >>> > On Apr 29, 2016, at 9:23 AM, Dmitriy Lyubimov <[email protected]>
> >>> wrote:
> >>> >
> >>> > yes -- i would do it as an optional option -- just like par does --
> do
> >>> > nothing; try auto, or try exact number of splits
> >>> >
> >>> > On Fri, Apr 29, 2016 at 9:15 AM, Pat Ferrel <[email protected]>
> >>> wrote:
> >>> >
> >>> >> It’s certainly easy to put this in the driver, taking it out of the
> >>> algo.
> >>> >>
> >>> >> Dmitriy, is it a candidate for an Option param to the algo? That
> would
> >>> >> catch cases where people rely on it now (like my old DStream
> example)
> >>> but
> >>> >> easily allow it to be overridden to None to imitate pre 0.11, or
> >>> passed in
> >>> >> when the app knows better.
> >>> >>
> >>> >> Nikaash, are you in a position to comment out the .par(auto=true)
> and
> >>> see
> >>> >> if it makes a difference?
> >>> >>
> >>> >>
> >>> >> On Apr 29, 2016, at 8:53 AM, Dmitriy Lyubimov <[email protected]>
> >>> wrote:
> >>> >>
> >>> >> can you please look into spark UI and write down how many split the
> >>> job
> >>> >> generates in the first stage of the pipeline, or anywhere else
> there's
> >>> >> signficant variation in # of splits in both cases?
> >>> >>
> >>> >> the row similarity is a very short pipeline (in comparison with what
> >>> would
> >>> >> normally be on average). so only the first input re-splitting is
> >>> critical.
> >>> >>
> >>> >> The splitting along the products is adjusted by optimizer
> >>> automatically to
> >>> >> match the amount of data segments observed on average in the
> input(s).
> >>> >> e.g.
> >>> >> if uyou compute val C = A %*% B and A has 500 elements per split and
> >>> B has
> >>> >> 5000 elements per split then C would approximately have 5000
> elements
> >>> per
> >>> >> split (the larger average in binary operator cases).  That's
> >>> approximately
> >>> >> how it works.
> >>> >>
> >>> >> However, the par() that has been added, is messing with initial
> >>> >> parallelism
> >>> >> which would naturally affect the rest of pipeline per above. I now
> >>> doubt
> >>> >> it
> >>> >> was a good thing -- when i suggested Pat to try this, i did not mean
> >>> to
> >>> >> put
> >>> >> it _inside_ the algorithm itself, rather, into the accurate input
> >>> >> preparation code in his particular case. However, I don't think it
> >>> will
> >>> >> work in any given case. Actually sweet spot parallelism for
> >>> >> multioplication
> >>> >> unfortunately depends on tons of factors -- network bandwidth and
> >>> hardware
> >>> >> configuration, so it is difficult to give it a good guess
> universally.
> >>> >> More
> >>> >> likely, for cli-based prepackaged algorithms (I don't use CLI but
> >>> rather
> >>> >> assemble pipelines in scala via scripting and scala application
> code)
> >>> the
> >>> >> initial paralellization adjustment options should probably be
> >>> provided to
> >>> >> CLI.
> >>> >>
> >>> >>
> >>> >
> >>> >
> >>> >
> >>>
> >>>
> >
>

Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

Reply via email to