Hi, Sean. I've added a comment in the new class to suggest a look at Hyperopt etc if the user is using Python.
Anyway I've created a pull request: https://github.com/apache/spark/pull/31535 and all tests, style checks etc pass. Wish me luck :) And thanks for the support :) Phillip On Mon, Feb 8, 2021 at 4:12 PM Sean Owen <sro...@gmail.com> wrote: > It seems pretty reasonable to me. If it's a pull request we can code > review it. > My only question is just, would it be better to tell people to use > hyperopt, and how much better is this than implementing randomization on > the grid. > But the API change isn't significant so maybe just fine. > > On Mon, Feb 8, 2021 at 3:49 AM Phillip Henry <londonjava...@gmail.com> > wrote: > >> Hi, Sean. >> >> I don't think sampling from a grid is a good idea as the min/max may lie >> between grid points. Unconstrained random sampling avoids this problem. To >> this end, I have an implementation at: >> >> https://github.com/apache/spark/compare/master...PhillHenry:master >> >> It is unit tested and does not change any already existing code. >> >> Totally get what you mean about Hyperopt but this is a pure JVM solution >> that's fairly straightforward. >> >> Is it worth contributing? >> >> Thanks, >> >> Phillip >> >> >> >> >> >> On Sat, Jan 30, 2021 at 2:00 PM Sean Owen <sro...@gmail.com> wrote: >> >>> I was thinking ParamGridBuilder would have to change to accommodate a >>> continuous range of values, and that's not hard, though other code wouldn't >>> understand that type of value, like the existing simple grid builder. >>> It's all possible just wondering if simply randomly sampling the grid is >>> enough. That would be a simpler change, just a new method or argument. >>> >>> Yes part of it is that if you really want to search continuous spaces, >>> hyperopt is probably even better, so how much do you want to put into >>> Pyspark - something really simple sure. >>> Not out of the question to do something more complex if it turns out to >>> also be pretty simple. >>> >>> On Sat, Jan 30, 2021 at 4:42 AM Phillip Henry <londonjava...@gmail.com> >>> wrote: >>> >>>> Hi, Sean. >>>> >>>> Perhaps I don't understand. As I see it, ParamGridBuilder builds an >>>> Array[ParamMap]. What I am proposing is a new class that also builds an >>>> Array[ParamMap] via its build() method, so there would be no "change in the >>>> APIs". This new class would, of course, have methods that defined the >>>> search space (log, linear, etc) over which random values were chosen. >>>> >>>> Now, if this is too trivial to warrant the work and people prefer >>>> Hyperopt, then so be it. It might be useful for people not using Python but >>>> they can just roll-their-own, I guess. >>>> >>>> Anyway, looking forward to hearing what you think. >>>> >>>> Regards, >>>> >>>> Phillip >>>> >>>> >>>> >>>> On Fri, Jan 29, 2021 at 4:18 PM Sean Owen <sro...@gmail.com> wrote: >>>> >>>>> I think that's a bit orthogonal - right now you can't specify >>>>> continuous spaces. The straightforward thing is to allow random sampling >>>>> from a big grid. You can create a geometric series of values to try, of >>>>> course - 0.001, 0.01, 0.1, etc. >>>>> Yes I get that if you're randomly choosing, you can randomly choose >>>>> from a continuous space of many kinds. I don't know if it helps a lot vs >>>>> the change in APIs (and continuous spaces don't make as much sense for >>>>> grid >>>>> search) >>>>> Of course it helps a lot if you're doing a smarter search over the >>>>> space, like what hyperopt does. For that, I mean, one can just use >>>>> hyperopt + Spark ML already if desired. >>>>> >>>>> On Fri, Jan 29, 2021 at 9:01 AM Phillip Henry <londonjava...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks, Sean! I hope to offer a PR next week. >>>>>> >>>>>> Not sure about a dependency on the grid search, though - but happy to >>>>>> hear your thoughts. I mean, you might want to explore logarithmic space >>>>>> evenly. For example, something like "please search 1e-7 to 1e-4" leads >>>>>> to >>>>>> a reasonably random sample being {3e-7, 2e-6, 9e-5}. These are (roughly) >>>>>> evenly spaced in logarithmic space but not in linear space. So, saying >>>>>> what >>>>>> fraction of a grid search to sample wouldn't make sense (unless the grid >>>>>> was warped, of course). >>>>>> >>>>>> Does that make sense? It might be better for me to just write the >>>>>> code as I don't think it would be very complicated. >>>>>> >>>>>> Happy to hear your thoughts. >>>>>> >>>>>> Phillip >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Jan 29, 2021 at 1:47 PM Sean Owen <sro...@gmail.com> wrote: >>>>>> >>>>>>> I don't know of anyone working on that. Yes I think it could be >>>>>>> useful. I think it might be easiest to implement by simply having some >>>>>>> parameter to the grid search process that says what fraction of all >>>>>>> possible combinations you want to randomly test. >>>>>>> >>>>>>> On Fri, Jan 29, 2021 at 5:52 AM Phillip Henry < >>>>>>> londonjava...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I have no work at the moment so I was wondering if anybody would be >>>>>>>> interested in me contributing code that generates an Array[ParamMap] >>>>>>>> for >>>>>>>> random hyperparameters? >>>>>>>> >>>>>>>> Apparently, this technique can find a hyperparameter in the top 5% >>>>>>>> of parameter space in fewer than 60 iterations with 95% confidence [1]. >>>>>>>> >>>>>>>> I notice that the Spark code base has only the brute force >>>>>>>> ParamGridBuilder unless I am missing something. >>>>>>>> >>>>>>>> Hyperparameter optimization is an area of interest to me but I >>>>>>>> don't want to re-invent the wheel. So, if this work is already >>>>>>>> underway or >>>>>>>> there are libraries out there to do it please let me know and I'll >>>>>>>> shut up >>>>>>>> :) >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Phillip >>>>>>>> >>>>>>>> [1] >>>>>>>> https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html >>>>>>>> >>>>>>>