Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

Dmitriy Lyubimov Mon, 02 May 2016 17:57:07 -0700

graph = graft, sorry. Graft just the AtB class into 0.12 codebase.

On Mon, May 2, 2016 at 9:06 AM, Dmitriy Lyubimov <[email protected]> wrote:


> ok.
>
> Nikaash,
> could you perhaps do one more experiment and graph the 0.10 a'b code into
> 0.12 code (or whatever branch you say is not working the same) so we could
> quite confirm that the culprit change is indeed AB'?
>
> thank you very much.
>
> -d
>
> On Mon, May 2, 2016 at 3:35 AM, Nikaash Puri <[email protected]>
> wrote:
>
>> Hi,
>>
>> I tried commenting out those lines and it did marginally improve the
>> performance. Although, the 0.10 version still significantly outperforms it.
>>
>> Here is a screenshot of the saveAsTextFile job (attached as selection1).
>> The AtB step took about 34 mins, which is significantly more than using
>> 0.10. Similarly, the saveAsTextFile action takes about 9 mins as well.
>>
>> The selection2 file is a screenshot of the flatMap at AtB.scala job,
>> which ran for 34 minutes,
>>
>> Also, I'm using multiple indicators. As of Mahout 0.10, the first AtB
>> would take time, while subsequent such operations for the other indicators
>> would be orders of magnitudes faster. In the current job, the subsequent
>> AtB operations take time similar to the first one.
>>
>> A snapshot of my code is as follows:
>>
>> var existingRowIDs: Option[BiDictionary] = None
>>
>> // The first action named in the sequence is the "primary" action and begins 
>> to fill up the user dictionary
>> for (actionDescription <- actionInput) {
>>   // grab the path to actions
>>   val action: IndexedDataset = SparkEngine.indexedDatasetDFSReadElements(
>>     actionDescription._2,
>>     schema = DefaultIndexedDatasetElementReadSchema,
>>     existingRowIDs = existingRowIDs)
>>   existingRowIDs = Some(action.rowIDs)
>>
>>   ...
>> }
>>
>> which seems fairly standard, so I hope I'm not making a mistake here.
>>
>> It looks like the 0.11 onward version is using computeAtBZipped3 for
>> performing the multiplication in atb_nograph_mmul unlike 0.10 which was
>> using atb_nograph. Though I'm not really sure whether that makes much of a
>> difference.
>>
>> Thank you,
>> Nikaash Puri
>>
>> On Sat, Apr 30, 2016 at 12:36 AM Pat Ferrel <[email protected]>
>> wrote:
>>
>>> Right, will do. But Nakaash if you could just comment out those lines
>>> and see if it has an effect it would be informative and even perhaps solve
>>> your problem sooner than my changes. No great rush. Playing around with
>>> different values, as Dmitriy says, might yield better results and for that
>>> you can mess with the code or wait for my changes.
>>>
>>> Yeah, it’s fast enough in most cases. The main work is the optimized
>>> A’A, A’B stuff in the BLAS optimizer Dmitriy put in. It is something like
>>> 10x faster than a similar algo in Hadoop MR. This particular calc and
>>> generalization is not in any other Spark or now Flink lib that I know of.
>>>
>>>
>>> On Apr 29, 2016, at 11:24 AM, Dmitriy Lyubimov <[email protected]>
>>> wrote:
>>>
>>> Nikaash,
>>>
>>> yes unfortunately you may need to play with parallelism for your
>>> particular
>>> load/cluster manually to get the best out of it. I guess Pat will be
>>> adding
>>> the option.
>>>
>>> On Fri, Apr 29, 2016 at 11:14 AM, Nikaash Puri <[email protected]>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > Sure, I’ll do some more detailed analysis of the jobs on the UI and
>>> share
>>> > screenshots if possible.
>>> >
>>> > Pat, yup, I’ll only be able to get to this on Monday, though. I’ll
>>> comment
>>> > out the line and see the difference in performance.
>>> >
>>> > Thanks so much for helping guys, I really appreciate it.
>>> >
>>> > Also, the algorithm implementation for LLR is extremely performant, at
>>> > least as of Mahout 0.10. I ran some tests for around 61 days of data
>>> (which
>>> > in our case is a fair amount) and the model was built in about 20
>>> minutes,
>>> > which is pretty amazing. This was using a pretty decent sized cluster,
>>> > though.
>>> >
>>> > Thank you,
>>> > Nikaash Puri
>>> >
>>> > On 29-Apr-2016, at 10:18 PM, Pat Ferrel <[email protected]> wrote:
>>> >
>>> > There are some other changes I want to make for the next rev so I’ll do
>>> > that.
>>> >
>>> > Nikaash, it would still be nice to verify this fixes your problem,
>>> also if
>>> > you want to create a Jira it will guarantee I don’t forget.
>>> >
>>> >
>>> > On Apr 29, 2016, at 9:23 AM, Dmitriy Lyubimov <[email protected]>
>>> wrote:
>>> >
>>> > yes -- i would do it as an optional option -- just like par does -- do
>>> > nothing; try auto, or try exact number of splits
>>> >
>>> > On Fri, Apr 29, 2016 at 9:15 AM, Pat Ferrel <[email protected]>
>>> wrote:
>>> >
>>> >> It’s certainly easy to put this in the driver, taking it out of the
>>> algo.
>>> >>
>>> >> Dmitriy, is it a candidate for an Option param to the algo? That would
>>> >> catch cases where people rely on it now (like my old DStream example)
>>> but
>>> >> easily allow it to be overridden to None to imitate pre 0.11, or
>>> passed in
>>> >> when the app knows better.
>>> >>
>>> >> Nikaash, are you in a position to comment out the .par(auto=true) and
>>> see
>>> >> if it makes a difference?
>>> >>
>>> >>
>>> >> On Apr 29, 2016, at 8:53 AM, Dmitriy Lyubimov <[email protected]>
>>> wrote:
>>> >>
>>> >> can you please look into spark UI and write down how many split the
>>> job
>>> >> generates in the first stage of the pipeline, or anywhere else there's
>>> >> signficant variation in # of splits in both cases?
>>> >>
>>> >> the row similarity is a very short pipeline (in comparison with what
>>> would
>>> >> normally be on average). so only the first input re-splitting is
>>> critical.
>>> >>
>>> >> The splitting along the products is adjusted by optimizer
>>> automatically to
>>> >> match the amount of data segments observed on average in the input(s).
>>> >> e.g.
>>> >> if uyou compute val C = A %*% B and A has 500 elements per split and
>>> B has
>>> >> 5000 elements per split then C would approximately have 5000 elements
>>> per
>>> >> split (the larger average in binary operator cases).  That's
>>> approximately
>>> >> how it works.
>>> >>
>>> >> However, the par() that has been added, is messing with initial
>>> >> parallelism
>>> >> which would naturally affect the rest of pipeline per above. I now
>>> doubt
>>> >> it
>>> >> was a good thing -- when i suggested Pat to try this, i did not mean
>>> to
>>> >> put
>>> >> it _inside_ the algorithm itself, rather, into the accurate input
>>> >> preparation code in his particular case. However, I don't think it
>>> will
>>> >> work in any given case. Actually sweet spot parallelism for
>>> >> multioplication
>>> >> unfortunately depends on tons of factors -- network bandwidth and
>>> hardware
>>> >> configuration, so it is difficult to give it a good guess universally.
>>> >> More
>>> >> likely, for cli-based prepackaged algorithms (I don't use CLI but
>>> rather
>>> >> assemble pipelines in scala via scripting and scala application code)
>>> the
>>> >> initial paralellization adjustment options should probably be
>>> provided to
>>> >> CLI.
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>>
>>>
>

Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

Reply via email to