Hi, Sean.

I've added a comment in the new class to suggest a look at Hyperopt etc if
the user is using Python.

Anyway I've created a pull request:

https://github.com/apache/spark/pull/31535

and all tests, style checks etc pass. Wish me luck :)

And thanks for the support :)

Phillip



On Mon, Feb 8, 2021 at 4:12 PM Sean Owen <sro...@gmail.com> wrote:

> It seems pretty reasonable to me. If it's a pull request we can code
> review it.
> My only question is just, would it be better to tell people to use
> hyperopt, and how much better is this than implementing randomization on
> the grid.
> But the API change isn't significant so maybe just fine.
>
> On Mon, Feb 8, 2021 at 3:49 AM Phillip Henry <londonjava...@gmail.com>
> wrote:
>
>> Hi, Sean.
>>
>> I don't think sampling from a grid is a good idea as the min/max may lie
>> between grid points. Unconstrained random sampling avoids this problem. To
>> this end, I have an implementation at:
>>
>> https://github.com/apache/spark/compare/master...PhillHenry:master
>>
>> It is unit tested and does not change any already existing code.
>>
>> Totally get what you mean about Hyperopt but this is a pure JVM solution
>> that's fairly straightforward.
>>
>> Is it worth contributing?
>>
>> Thanks,
>>
>> Phillip
>>
>>
>>
>>
>>
>> On Sat, Jan 30, 2021 at 2:00 PM Sean Owen <sro...@gmail.com> wrote:
>>
>>> I was thinking ParamGridBuilder would have to change to accommodate a
>>> continuous range of values, and that's not hard, though other code wouldn't
>>> understand that type of value, like the existing simple grid builder.
>>> It's all possible just wondering if simply randomly sampling the grid is
>>> enough. That would be a simpler change, just a new method or argument.
>>>
>>> Yes part of it is that if you really want to search continuous spaces,
>>> hyperopt is probably even better, so how much do you want to put into
>>> Pyspark - something really simple sure.
>>> Not out of the question to do something more complex if it turns out to
>>> also be pretty simple.
>>>
>>> On Sat, Jan 30, 2021 at 4:42 AM Phillip Henry <londonjava...@gmail.com>
>>> wrote:
>>>
>>>> Hi, Sean.
>>>>
>>>> Perhaps I don't understand. As I see it, ParamGridBuilder builds an
>>>> Array[ParamMap]. What I am proposing is a new class that also builds an
>>>> Array[ParamMap] via its build() method, so there would be no "change in the
>>>> APIs". This new class would, of course, have methods that defined the
>>>> search space (log, linear, etc) over which random values were chosen.
>>>>
>>>> Now, if this is too trivial to warrant the work and people prefer
>>>> Hyperopt, then so be it. It might be useful for people not using Python but
>>>> they can just roll-their-own, I guess.
>>>>
>>>> Anyway, looking forward to hearing what you think.
>>>>
>>>> Regards,
>>>>
>>>> Phillip
>>>>
>>>>
>>>>
>>>> On Fri, Jan 29, 2021 at 4:18 PM Sean Owen <sro...@gmail.com> wrote:
>>>>
>>>>> I think that's a bit orthogonal - right now you can't specify
>>>>> continuous spaces. The straightforward thing is to allow random sampling
>>>>> from a big grid. You can create a geometric series of values to try, of
>>>>> course - 0.001, 0.01, 0.1, etc.
>>>>> Yes I get that if you're randomly choosing, you can randomly choose
>>>>> from a continuous space of many kinds. I don't know if it helps a lot vs
>>>>> the change in APIs (and continuous spaces don't make as much sense for 
>>>>> grid
>>>>> search)
>>>>> Of course it helps a lot if you're doing a smarter search over the
>>>>> space, like what hyperopt does. For that, I mean, one can just use
>>>>> hyperopt + Spark ML already if desired.
>>>>>
>>>>> On Fri, Jan 29, 2021 at 9:01 AM Phillip Henry <londonjava...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks, Sean! I hope to offer a PR next week.
>>>>>>
>>>>>> Not sure about a dependency on the grid search, though - but happy to
>>>>>> hear your thoughts. I mean, you might want to explore logarithmic space
>>>>>> evenly. For example,  something like "please search 1e-7 to 1e-4" leads 
>>>>>> to
>>>>>> a reasonably random sample being {3e-7, 2e-6, 9e-5}. These are (roughly)
>>>>>> evenly spaced in logarithmic space but not in linear space. So, saying 
>>>>>> what
>>>>>> fraction of a grid search to sample wouldn't make sense (unless the grid
>>>>>> was warped, of course).
>>>>>>
>>>>>> Does that make sense? It might be better for me to just write the
>>>>>> code as I don't think it would be very complicated.
>>>>>>
>>>>>> Happy to hear your thoughts.
>>>>>>
>>>>>> Phillip
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 29, 2021 at 1:47 PM Sean Owen <sro...@gmail.com> wrote:
>>>>>>
>>>>>>> I don't know of anyone working on that. Yes I think it could be
>>>>>>> useful. I think it might be easiest to implement by simply having some
>>>>>>> parameter to the grid search process that says what fraction of all
>>>>>>> possible combinations you want to randomly test.
>>>>>>>
>>>>>>> On Fri, Jan 29, 2021 at 5:52 AM Phillip Henry <
>>>>>>> londonjava...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have no work at the moment so I was wondering if anybody would be
>>>>>>>> interested in me contributing code that generates an Array[ParamMap] 
>>>>>>>> for
>>>>>>>> random hyperparameters?
>>>>>>>>
>>>>>>>> Apparently, this technique can find a hyperparameter in the top 5%
>>>>>>>> of parameter space in fewer than 60 iterations with 95% confidence [1].
>>>>>>>>
>>>>>>>> I notice that the Spark code base has only the brute force
>>>>>>>> ParamGridBuilder unless I am missing something.
>>>>>>>>
>>>>>>>> Hyperparameter optimization is an area of interest to me but I
>>>>>>>> don't want to re-invent the wheel. So, if this work is already 
>>>>>>>> underway or
>>>>>>>> there are libraries out there to do it please let me know and I'll 
>>>>>>>> shut up
>>>>>>>> :)
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Phillip
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
>>>>>>>>
>>>>>>>

Reply via email to