Hey Do, I think that more sophisticated samplers could make a better fit in the ML library and not in the core API but I am not very familiar with the milestones there. Maybe the maintainers of the batch ML library could check if sampling techniques could be useful there I guess.
Paris > On 11 Jul 2016, at 16:15, Le Quoc Do <lequo...@gmail.com> wrote: > > Hi all, > > Thank you all for your answers. > By the way, I also recognized that Flink doesn't support "stratified > sampling" function (only simple random sampling) for DataSet. > It would be nice if someone can create a Jira for it, and assign the task > to me so that I can work for it. > > Thank you, > Do > > On Mon, Jul 11, 2016 at 11:44 AM, Vasiliki Kalavri < > vasilikikala...@gmail.com> wrote: > >> Hi Do, >> >> Paris and Martha worked on sampling techniques for data streams on Flink >> last year. If you want to implement your own samplers, you might find >> Martha's master thesis helpful [1]. >> >> -Vasia. >> >> [1]: http://kth.diva-portal.org/smash/get/diva2:910695/FULLTEXT01.pdf >> >> On 11 July 2016 at 11:31, Kostas Kloudas <k.klou...@data-artisans.com> >> wrote: >> >>> Hi Do, >>> >>> In DataStream you can always implement your own >>> sampling function, hopefully without too much effort. >>> >>> Adding such functionality it to the API could be a good idea. >>> But given that in sampling there is no “one-size-fits-all” >>> solution (as not every use case needs random sampling and not >>> all random samplers fit to all workloads), I am not sure if we >>> should start adding different sampling operators. >>> >>> Thanks, >>> Kostas >>> >>>> On Jul 9, 2016, at 5:43 PM, Greg Hogan <c...@greghogan.com> wrote: >>>> >>>> Hi Do, >>>> >>>> DataSet provides a stable @Public interface. DataSetUtils is marked >>>> @PublicEvolving which is intended for public use, has stable behavior, >>> but >>>> method signatures may change. It's also good to limit DataSet to common >>>> methods whereas the utility methods tend to be used for specific >>>> applications. >>>> >>>> I don't have the pulse of streaming but this sounds like a useful >> feature >>>> that could be added. >>>> >>>> Greg >>>> >>>> On Sat, Jul 9, 2016 at 10:47 AM, Le Quoc Do <lequo...@gmail.com> >> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I'm working on approximate computing using sampling techniques. I >>>>> recognized that Flink supports the sample function for Dataset >>>>> (org/apache/flink/api/java/utils/DataSetUtils.java). I'm just >> wondering >>> why >>>>> you didn't merge the function to >> org/apache/flink/api/java/DataSet.java >>>>> since the sample function works as a transformation operator? >>>>> >>>>> The second question is that are you planning to support the sample >>>>> function for DataStream (within windows) since I did not see it in >>>>> DataStream code ? >>>>> >>>>> Thank you, >>>>> Do >>>>> >>> >>> >>