Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

Dmitriy Lyubimov Fri, 29 Apr 2016 11:26:25 -0700

Nikaash,

yes unfortunately you may need to play with parallelism for your particular
load/cluster manually to get the best out of it. I guess Pat will be adding
the option.


On Fri, Apr 29, 2016 at 11:14 AM, Nikaash Puri <[email protected]>
wrote:

> Hi,
>
> Sure, I’ll do some more detailed analysis of the jobs on the UI and share
> screenshots if possible.
>
> Pat, yup, I’ll only be able to get to this on Monday, though. I’ll comment
> out the line and see the difference in performance.
>
> Thanks so much for helping guys, I really appreciate it.
>
> Also, the algorithm implementation for LLR is extremely performant, at
> least as of Mahout 0.10. I ran some tests for around 61 days of data (which
> in our case is a fair amount) and the model was built in about 20 minutes,
> which is pretty amazing. This was using a pretty decent sized cluster,
> though.
>
> Thank you,
> Nikaash Puri
>
> On 29-Apr-2016, at 10:18 PM, Pat Ferrel <[email protected]> wrote:
>
> There are some other changes I want to make for the next rev so I’ll do
> that.
>
> Nikaash, it would still be nice to verify this fixes your problem, also if
> you want to create a Jira it will guarantee I don’t forget.
>
>
> On Apr 29, 2016, at 9:23 AM, Dmitriy Lyubimov <[email protected]> wrote:
>
> yes -- i would do it as an optional option -- just like par does -- do
> nothing; try auto, or try exact number of splits
>
> On Fri, Apr 29, 2016 at 9:15 AM, Pat Ferrel <[email protected]> wrote:
>
>> It’s certainly easy to put this in the driver, taking it out of the algo.
>>
>> Dmitriy, is it a candidate for an Option param to the algo? That would
>> catch cases where people rely on it now (like my old DStream example) but
>> easily allow it to be overridden to None to imitate pre 0.11, or passed in
>> when the app knows better.
>>
>> Nikaash, are you in a position to comment out the .par(auto=true) and see
>> if it makes a difference?
>>
>>
>> On Apr 29, 2016, at 8:53 AM, Dmitriy Lyubimov <[email protected]> wrote:
>>
>> can you please look into spark UI and write down how many split the job
>> generates in the first stage of the pipeline, or anywhere else there's
>> signficant variation in # of splits in both cases?
>>
>> the row similarity is a very short pipeline (in comparison with what would
>> normally be on average). so only the first input re-splitting is critical.
>>
>> The splitting along the products is adjusted by optimizer automatically to
>> match the amount of data segments observed on average in the input(s).
>> e.g.
>> if uyou compute val C = A %*% B and A has 500 elements per split and B has
>> 5000 elements per split then C would approximately have 5000 elements per
>> split (the larger average in binary operator cases).  That's approximately
>> how it works.
>>
>> However, the par() that has been added, is messing with initial
>> parallelism
>> which would naturally affect the rest of pipeline per above. I now doubt
>> it
>> was a good thing -- when i suggested Pat to try this, i did not mean to
>> put
>> it _inside_ the algorithm itself, rather, into the accurate input
>> preparation code in his particular case. However, I don't think it will
>> work in any given case. Actually sweet spot parallelism for
>> multioplication
>> unfortunately depends on tons of factors -- network bandwidth and hardware
>> configuration, so it is difficult to give it a good guess universally.
>> More
>> likely, for cli-based prepackaged algorithms (I don't use CLI but rather
>> assemble pipelines in scala via scripting and scala application code) the
>> initial paralellization adjustment options should probably be provided to
>> CLI.
>>
>>
>
>
>

Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

Reply via email to