Can you file an issue if nothing else to track the discussion?
On Sat, Sep 10, 2016 at 18:13 Oscar Boykin <[email protected]> wrote:

> Wait, sorry. Looking more carefully. Is the bug that originally all data
> with id 0 was in one reducer but with sorting it winds up on both? That
> would be a bug. What version of scalding is this?
>
> Can you replicate this bug in a minimal case? Sorting should not change
> how the keys are paritioned to reducer (which is done by hashCode of the
> key, which is the same, I suppose).
>
> Basically the test you want to write is that after groupBy with sortBy if
> you take only the keys in the output each key appears exactly once.
>
> I have a hard time believing there could have been a bug like this that we
> didn't notice for 5 years but I guess it is possible.
> On Sat, Sep 10, 2016 at 17:33 ravi kiran holur vijay <
> [email protected]> wrote:
>
>> Hey Oscar,
>>
>> Sorry, sounds like I might have misunderstood the semantics of groupBy
>> followed by sortBy.
>> Is there a way to make sure ALL records having the same key end up at the
>> same reducer (what groupBy does) and within each reducer, have it sorted by
>> value (what sortBy does)?
>>
>> -Ravi
>>
>> On Sat, Sep 10, 2016 at 8:06 PM, Oscar Boykin <[email protected]> wrote:
>>
>>> Sorry, I don't follow. What did you expect. I don't see a bug.
>>>
>>> The data looks sorted within groups, which is all sortBy does.
>>>
>>> Note, you don't need forceToReducers here. Sorting can only be done on
>>> the reducers.
>>> On Sat, Sep 10, 2016 at 15:12 ravi kiran holur vijay <
>>> [email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>> I am noticing strange behaviour in my Scalding job which uses groupby
>>>> and sortby. If I do not have a .sortBy function, each of the reducers are
>>>> getting all of the values for the same group key. However, if I use
>>>> .sortBy, each reducer is getting only part of the values for the same group
>>>> key. I was wondering if any of you have run into a similar issue before or
>>>> have a hypothesis about what's happening?
>>>>
>>>> Case 1: Observed behaviour = Expected behaviour, without using sortBy
>>>>
>>>> *Reducer 1 output*:
>>>>
>>>> Processing data for group ... 1
>>>> Initializing FM Model with existing parameters ...
>>>> Processing model param ... o
>>>> Processing model param ... w
>>>> Processing model param ... r
>>>> Processing model param ... s
>>>> Processing model param ... t
>>>> Processing model param ... l
>>>> Processing model param ... f
>>>> Processing model param ... v
>>>> Processing model param ... v
>>>> Processing model param ... v
>>>> Processing model param ... v
>>>> Processing model param ... v
>>>> Initialized FM Model with: w0=-0.181250, w=34531087, v=5,
>>>> reg0=0.000000, regw=0.000000, regv=0.000000, lr=0.010000,
>>>> statsFreq=1000000, merged models=1
>>>>
>>>> *Reducer 2 output*:
>>>> Processing data for group ... 0
>>>> Initializing FM Model with existing parameters ...
>>>> Processing model param ... o
>>>> Processing model param ... w
>>>> Processing model param ... r
>>>> Processing model param ... s
>>>> Processing model param ... t
>>>> Processing model param ... l
>>>> Processing model param ... f
>>>> Processing model param ... v
>>>> Processing model param ... v
>>>> Processing model param ... v
>>>> Processing model param ... v
>>>> Processing model param ... v
>>>> Initialized FM Model with: w0=-0.181250, w=34531087, v=5,
>>>> reg0=0.000000, regw=0.000000, regv=0.000000, lr=0.010000,
>>>> statsFreq=1000000, merged models=1
>>>>
>>>> Case 2: Observed behaviour != Expected behaviour, after using sortBy
>>>> *Reducer 1 output*
>>>> Processing data for group ... 0
>>>> Initializing FM Model with existing parameters ...
>>>> Processing model param ... v
>>>> Processing model param ... v
>>>> Processing model param ... v
>>>> Processing model param ... v
>>>> Processing model param ... v
>>>> Initialized FM Model with: w0=0.000000, w=0, v=5, reg0=0.000000,
>>>> regw=0.000000, regv=0.000000, lr=0.000000, statsFreq=0, merged models=0
>>>> Processing data for group ... 1
>>>> Initializing FM Model with existing parameters ...
>>>> Processing model param ... f
>>>> Processing model param ... l
>>>> Processing model param ... o
>>>> Processing model param ... r
>>>> Processing model param ... s
>>>> Processing model param ... t
>>>> Processing model param ... w
>>>> Initialized FM Model with: w0=-0.181250, w=34531087, v=0,
>>>> reg0=0.000000, regw=0.000000, regv=0.000000, lr=0.010000,
>>>> statsFreq=1000000, merged models=1
>>>>
>>>> *Reducer 2 output*
>>>> Processing data for group ... 0
>>>> Initializing FM Model with existing parameters ...
>>>> Processing model param ... f
>>>> Processing model param ... l
>>>> Processing model param ... o
>>>> Processing model param ... v
>>>> Processing model param ... r
>>>> Processing model param ... s
>>>> Processing model param ... t
>>>> Processing model param ... w
>>>> Initialized FM Model with: w0=-0.181250, w=34531087, v=1,
>>>> reg0=0.000000, regw=0.000000, regv=0.000000, lr=0.010000,
>>>> statsFreq=1000000, merged models=1
>>>> Processing data for group ... 1
>>>> Initializing FM Model with existing parameters ...
>>>> Processing model param ... v
>>>> Processing model param ... v
>>>> Processing model param ... v
>>>> Processing model param ... v
>>>> Initialized FM Model with: w0=0.000000, w=0, v=4, reg0=0.000000,
>>>> regw=0.000000, regv=0.000000, lr=0.000000, statsFreq=0, merged models=0
>>>>
>>>> *Code*
>>>>
>>>> val data: TypedPipe[(Int, Float, Either[FeatureVector, FMModelParameter])] 
>>>> = modelData
>>>> val fmModels: SortedGrouped[Int, FMModel] = data
>>>>   .groupBy { case (id1, id2, modelParam) => id1 }
>>>>   .sortBy { case (id1, id2, modelParam) => id2 }
>>>>   .forceToReducers
>>>>   //Secondary is needed to ensure model parameters appear before actual 
>>>> training data
>>>>   //TODO: This sortby is causing problems and has a bug
>>>>   .mapGroup {
>>>>   case (groupId, records) =>
>>>>     println("Processing data for group ... " + groupId)
>>>>     val trainedModel = aggregateAndUpdateModel(records)
>>>>     Iterator(trainedModel)
>>>> }
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Scalding Development" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to