Re: [DISCUSS] FLIP proposal for Model Serving over Flink

2017-07-05 Thread Theodore Vasiloudis
Hello all,

I just wanted to indicate that I would be very willing to help out with
code reviews
for this project and participating in design discussions.

But I should note that I don' think I'll have time to contribute code until
I get back to Stockholm in September.

Regards,
Theodore

On Tue, Jul 4, 2017 at 9:41 AM, Andrea Spina <andrea.sp...@radicalbit.io>
wrote:

> Hi all,
> yes, we did too. We - from Radicalbit - have submitted a talk focused
> on the recently released flink-jpmml library about model serving.
> Lately, it became part of the FlinkML project.
>
> Cheers, Andrea
>
> 2017-07-04 16:14 GMT+02:00 Boris Lublinsky <boris.lublin...@lightbend.com
> >:
> > Yes,
> > I submitted a talk with Stavros on model serving
> >
> >
> > Boris Lublinsky
> > FDP Architect
> > boris.lublin...@lightbend.com
> > https://www.lightbend.com/
> >
> > On Jul 3, 2017, at 1:18 PM, Robert Metzger <rmetz...@apache.org> wrote:
> >
> > Big +1 from my side on getting this effort started.
> >
> > Users have asked for this and I would like to see some progress there.
> > Did anybody submit a talk about the ML efforts to Flink Forward Berlin
> this
> > year?
> >
> > On Fri, Jun 30, 2017 at 6:04 PM, Fabian Hueske <fhue...@gmail.com>
> wrote:
> >>
> >> Yes, I know that Theo is engaged in the ML efforts but wasn't sure how
> >> much
> >> he is involved in the model serving part (thought he was more into the
> >> online learning part).
> >> It would be great if Theo could help here!
> >>
> >> I just wanted to make sure that we find somebody to help bootstrapping.
> >>
> >> Cheers, Fabian
> >>
> >>
> >> 2017-06-30 17:52 GMT+02:00 Stavros Kontopoulos <
> st.kontopou...@gmail.com>:
> >>
> >> > Hi Fabian,
> >> >
> >> > However, we should keep in mind that we need a committer to bootstrap
> >> > the
> >> > > new module.
> >> >
> >> >
> >> > Absolutely I thought Theodore Vassiloudis could help, as an initial
> >> > committer.
> >> > Is this known? He is part of the effort btw.
> >> >
> >> > Best,
> >> > Stavros
> >> >
> >> > On Fri, Jun 30, 2017 at 6:42 PM, Fabian Hueske <fhue...@gmail.com>
> >> > wrote:
> >> >
> >> > > Thanks Stavros (and everybody else involved) for starting this
> effort
> >> > > and
> >> > > bringing the discussion back to the mailing list.
> >> > >
> >> > > As I said before, a model serving module/component would be a great
> >> > feature
> >> > > for Flink.
> >> > > I see the biggest advantage for such a module in the integration
> with
> >> > > the
> >> > > other APIs and libraries, such as DataStream, CEP, SQL.
> >> > >
> >> > > A FLIP would be a great way to continue your efforts and work on a
> >> > > design
> >> > > for the component.
> >> > >
> >> > > However, we should keep in mind that we need a committer to
> bootstrap
> >> > > the
> >> > > new module.
> >> > > As people are contributing to the model serving module, the number
> of
> >> > > committers should hopefully grow after some time.
> >> > >
> >> > > Best, Fabian
> >> > >
> >> > > 2017-06-30 10:58 GMT+02:00 Stavros Kontopoulos
> >> > > <st.kontopou...@gmail.com
> >> > >:
> >> > >
> >> > > > Hi all,
> >> > > >
> >> > > > After coordinating with Theodore Vasiloudis and the guys behind
> the
> >> > Flink
> >> > > > Model Serving effort (Eron, Radicalbit people, Boris, Bas (ING)),
> we
> >> > > > propose to start working on the model serving over Flink in a more
> >> > > official
> >> > > > way.
> >> > > > That translates to capturing design details in a FLIP document.
> >> > > >
> >> > > > Please let's discuss and vote whether you think this FLIP would be
> >> > > viable.
> >> > > >
> >> > > > Model Serving as a Flink component might involve a lot of work and
> >> > > > we
> >> > > need
> >> > > > to commit to support it in future Flink releases.
> >> > > >
> >> > > > In the mean time a lot of people have joined Flink ml slack
> channel
> >> > > > (
> >> > > > https://flinkml.slack.com, https://flinkml-invites.herokuapp.com/
> )
> >> > and I
> >> > > > think its time to try get them gradually on board.
> >> > > >
> >> > > > So far we have several efforts hosted here:
> >> > > > https://github.com/FlinkML
> >> > > >
> >> > > > Related documents for what we are doing:
> >> > > >
> >> > > > Flink ML roadmap
> >> > > >
> >> > > > https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3U
> >> > > > d06MIRhahtJ6dw/edit
> >> > > >
> >> > > > Flink MS
> >> > > >
> >> > > > https://docs.google.com/document/d/1CjWL9aLxPrKytKxUF5c3ohs0ickp0
> >> > > > fdEXPsPYPEywsE/edit#
> >> > > >
> >> > > > PS. I will work on the last document the next few days to
> >> > > > consolidate
> >> > > > effort results to some extend and break work down.
> >> > > > Our target is to provide a generic API based on some plugin
> >> > architecture
> >> > > to
> >> > > > serve different popular models/pipelines along with custom ones
> over
> >> > > flink.
> >> > > >
> >> > > > Best,
> >> > > > Stavros
> >> > > >
> >> > >
> >> >
> >
> >
> >
>


Re: FlinkML on slack

2017-06-22 Thread Theodore Vasiloudis
Hello all,

We've created an app to automate the invite process, now you can just use
the following link
to get an invite to the FlinkML Slack group:

https://flinkml-invites.herokuapp.com/

Regards,
Theodore

On Tue, Jun 20, 2017 at 8:45 AM, Stavros Kontopoulos <
st.kontopou...@gmail.com> wrote:

> Sebastian Jark Shaoxuan done.
> Stavros
>
> On Tue, Jun 20, 2017 at 11:09 AM, Sebastian Schelter <
> ssc.o...@googlemail.com> wrote:
>
> > I'd also like to get an invite to this slack, my email is s...@apache.org
> >
> > Best,
> > Sebastian
> >
> > 2017-06-20 8:37 GMT+02:00 Jark Wu :
> >
> > > Hi, Stravros:
> > > Could you please invite me to the FlinkML slack channel as well? My
> email
> > > is: imj...@gmail.com
> > >
> > > Thanks,
> > > Jark
> > >
> > > 2017-06-20 13:58 GMT+08:00 Shaoxuan Wang :
> > >
> > > > Hi Stavros,
> > > > Can I get an invitation for the slack channel.
> > > >
> > > > Thanks,
> > > > Shaoxuan
> > > >
> > > >
> > > > On Thu, Jun 8, 2017 at 3:56 AM, Stavros Kontopoulos <
> > > > st.kontopou...@gmail.com> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > We took the initiative to create the organization for FlinkML on
> > slack
> > > > > (thnx Eron).
> > > > > There is now a channel for model-serving
> > > > >  > > > > fdEXPsPYPEywsE/edit#>.
> > > > > Another is coming for flink-jpmml.
> > > > > You are invited to join the channels and the efforts. @Gabor @Theo
> > > please
> > > > > consider adding channels for the other efforts there as well.
> > > > >
> > > > > FlinkMS on Slack  (
> > > > https://flinkml.slack.com/)
> > > > >
> > > > > Details for the efforts here: Flink Roadmap doc
> > > > >  > > > > d06MIRhahtJ6dw/edit#>
> > > > >
> > > > > Github  (https://github.com/FlinkML)
> > > > >
> > > > >
> > > > > Stavros
> > > > >
> > > >
> > >
> >
>


Re: FlinkML on slack

2017-06-09 Thread Theodore Vasiloudis
Thank you for the guidelines Robert, I've replaced the Flink logo with a
placeholder image.

On Thu, Jun 8, 2017 at 8:24 AM, Robert Metzger  wrote:

> I'm happy to see efforts towards machine learning on Apache Flink within
> the community!
>
> I think its okay to have a GitHub repository for the ML stuff, but there
> are some rules by the ASF you have to follow: https://www.apache.org/
> foundation/marks/faq/#products
>
> As immediate steps, I would remove the Flink logo from the GitHub orga and
> add a disclaimer to the flink-modelserver repo.
> Also, I think it would be helpful to put a disclaimer (maybe into the orga
> description) that it is not an official Apache project.
>
> One last thing: I haven't completely forgotten the "repo split" discussion
> we had a while back. Maybe we could really start by just putting the ML
> library into a separate repo.
> Now that Theo is a committer, it should be easier to commit code from the
> entire ML crew into the repo.
>
>
>
> On Thu, Jun 8, 2017 at 4:32 PM, Stavros Kontopoulos <
> st.kontopou...@gmail.com> wrote:
>
>> @Ted Yu sure.
>>
>> On Thu, Jun 8, 2017 at 5:18 PM, Ted Yu  wrote:
>>
>> > Hi Stavros,
>> > Can you add me as well ?
>> >
>> > Thanks
>> >
>> > On Wed, Jun 7, 2017 at 12:56 PM, Stavros Kontopoulos <
>> > st.kontopou...@gmail.com> wrote:
>> >
>> > > Hi all,
>> > >
>> > > We took the initiative to create the organization for FlinkML on slack
>> > > (thnx Eron).
>> > > There is now a channel for model-serving
>> > > > > > fdEXPsPYPEywsE/edit#>.
>> > > Another is coming for flink-jpmml.
>> > > You are invited to join the channels and the efforts. @Gabor @Theo
>> please
>> > > consider adding channels for the other efforts there as well.
>> > >
>> > > FlinkMS on Slack  (
>> > https://flinkml.slack.com/)
>> > >
>> > > Details for the efforts here: Flink Roadmap doc
>> > > > > > d06MIRhahtJ6dw/edit#>
>> > >
>> > > Github  (https://github.com/FlinkML)
>> > >
>> > >
>> > > Stavros
>> > >
>> >
>>
>
>


Re: Machine Learning on Flink - Next steps

2017-03-20 Thread Theodore Vasiloudis
Hello all,

I've updated the original Gdoc [1] to include a table with coordinators
and people interested in contributing to the specific projects. With
this latest additions we have many people willing to contribute to
the online learning library, and 2 people who have shown interested
to at least one of the other projects. Feel free to reassign yourself
if you feel like it, these are all indicative of intention anyway, not
commitments (except for the coordinators).

I don't think I'll have the time to set up the online learning doc this
week,
if anyone would like to jump ahead and do that feel free.
Gabor has already started it for the "fast-batch" project, and Stavros has
started with the model serving project as well :)

@Ventura: I would love to see the design principles and abstractions you
have
created for that project, let us know if there is anything you can share
now.

Regards,
Theodore


[1]
https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/edit?usp=sharing

On Mon, Mar 20, 2017 at 3:56 PM, Gábor Hermann <m...@gaborhermann.com>
wrote:

> Hi all,
>
> @Theodore:
> +1 for the CTR use-case. Thanks for the suggestion!
>
> @Katherin:
> +1 for reflecting the choices made here and contributor commitment in Gdoc.
>
> @Tao, @Ventura:
> It's great to here you have been working on ML on Flink :)
> I hope we can all aggregate our efforts somehow. It would be best if you
> could contribute some of your work.
>
>
> I've started putting together a Gdoc specifically for *Offline/incremental
> learning on Streaming API*:
> https://docs.google.com/document/d/18BqoFTQ0dPkbyO-PWBMMpW5N
> l0pjobSubnWpW0_r8yA/
> Right now you can comment/give suggestions there. I'd like to start a
> separate mailing list discussion as soon as there are enough contributors
> volunteering for this direction. For now, I'm trying to reflect the
> relevant part of the discussion here and the initial Gdoc [1].
>
>
> [1] https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
> 49h3Ud06MIRhahtJ6dw/
>
> Cheers,
> Gabor
>
>
> On 2017-03-20 14:27, Ventura Del Monte wrote:
>
> Hello everyone,
>>
>> Here at DFKI, we are currently working on project that involves developing
>> open-source Online Machine Learning algorithms on top of Flink.
>> So far, we have simple moments, sampling (e.g.: simple reservoir sampling)
>> and sketches (e.g., Frequent Directions) built on top of scikit-like
>> abstractions and Flink's DataStream/KeyedStream.
>> Moreover, we have few industrial use cases and we are gonna validate our
>> implementation using real industrial data.
>> We plan to implement more advanced algorithms in the future as well as to
>> share our results with you and contribute, in case you are interested.
>>
>> Best,
>> Ventura
>>
>>
>>
>>
>> This message, for the D. Lgs n. 196/2003 (Privacy Code), may contain
>> confidential and/or privileged information. If you are not the addressee
>> or
>> authorized to receive this for the addressee, you must not use, copy,
>> disclose or take any action based on this message or any information
>> herein. If you have received this message in error, please advise the
>> sender immediately by reply e-mail and delete this message. Thank you for
>> your cooperation.
>>
>> On Mon, Mar 20, 2017 at 12:26 PM, Tao Meng <oatg...@gmail.com> wrote:
>>
>> Hi All,
>>>
>>> Sorry for joining this discussion late.
>>> My graduation thesis is about online learning system. I would build it on
>>> flink in the next three months.
>>>
>>> I'd like to contribute on:
>>>   - Online learning
>>>
>>>
>>>
>>>
>>> On Mon, Mar 20, 2017 at 6:51 PM Katherin Eri <katherinm...@gmail.com>
>>> wrote:
>>>
>>> Hello, Theodore
>>>
>>> Could you please move vectors of development and their prioritized
>>> positions from *## Executive summary* to the google doc?
>>>
>>>
>>>
>>> Could you please also create some table in google doc, that is
>>> representing
>>> the selected directions and persons, who would like to drive or
>>> participate
>>> in the particular topic, in order to make this process transparent for
>>> community and sum up current state of commitment of contributors?
>>>
>>> There we could simply inscribe us to some topic.
>>>
>>>
>>>
>>> And 1+ for CTR prediction case.
>>>
>>> вс, 19 мар. 2017 г. в 16:49, Theodore Vasiloudis <
>>> theodoros.vasilou...@gmail.com>:
>

Re: Machine Learning on Flink - Next steps

2017-03-19 Thread Theodore Vasiloudis
Hello Stavros,

The way I thought we'd do it is that each shepherd would be responsible for
organizing the project: that includes setting up a Google doc, sending an
email to the dev list to inform the wider community, and if possible,
personally contacting the people who expressed interest in the project.

Would you be willing to lead that effort for the model serving project?

Regards,
Theodore

-- 
Sent from a mobile device. May contain autocorrect errors.

On Mar 19, 2017 3:49 AM, "Stavros Kontopoulos" 
wrote:

> Hi all...
>
> I agree about the tensorflow integration it seems to be important from what
> I hear.
> Should we sign up somewhere for the working groups (gdcos)?
> I would like to start helping with the model serving feature.
>
> Best Regards,
> Stavros
>
> On Fri, Mar 17, 2017 at 10:34 PM, Gábor Hermann 
> wrote:
>
> > Hi Chen,
> >
> > Thanks for the input! :)
> >
> > There is already a project [1] for using TensorFlow models in Flink, and
> > Theodore has suggested
> > to contact the author, Eron Wright for the model serving direction.
> >
> >
> > [1] http://sf.flink-forward.org/kb_sessions/introducing-flink-
> tensorflow/
> >
> > Cheers,
> > Gabor
> >
> >
> > On 2017-03-17 19:41, Chen Qin wrote:
> >
> >> [1]http://sf.flink-forward.org/kb_sessions/introducing-flink-te
> >> nsorflow/
> >>
> >
> >
>


Re: Machine Learning on Flink - Next steps

2017-03-17 Thread Theodore Vasiloudis
t;
>> - A question I have not yet a good intuition on is whether the "model
>> evaluation" and the training part are so
>>  different (one a good abstraction for model evaluation has been
>> built)
>> that there is little cross coordination needed,
>>  or whether there is potential in integrating them.
>>
>>
>> *Thoughts on the ML training library (DataSet API or DataStream API)*
>>
>>- I honestly don't quite understand what the big difference will be in
>> targeting the batch or streaming API. You can use the
>>  DataSet API in a quite low-level fashion (missing async iterations).
>>
>>- There seems especially now to be a big trend towards deep learning
>> (is
>> it just temporary or will this be the future?) and in
>>   that space, little works without GPU acceleration.
>>
>>- It is always easier to do something new than to be the n-th version
>> of
>> something existing (sorry for the generic true-ism).
>>  The later admittedly gives the "all in one integrated framework"
>> advantage (which can be a very strong argument indeed),
>>  but the former attracts completely new communities and can often make
>> more impact with less effort.
>>
>>- The "new" is not required to be "online learning", where Theo has
>> described some concerns well.
>>  It can also be traditional ML re-imagined for "continuous
>> applications", as "continuous / incremental re-training" or so.
>>  Even on the "model evaluation side", there is a lot of interesting
>> stuff as mentioned already, like ensembles, multi-armed bandits, ...
>>
>>- It may be well worth tapping into the work of an existing library
>> (like
>> tensorflow) for an easy fix to some hard problems (pre-existing
>>  hardware integration, pre-existing optimized linear algebra solvers,
>> etc) and think about how such use cases would look like in
>>  the context of typical Flink applications.
>>
>>
>> *A bit of engine background information that may help in the planning:*
>>
>>- The DataStream API will in the future also support bounded data
>> computations explicitly (I say this not as a fact, but as
>>  a strong believer that this is the right direction).
>>
>>- Batch runtime execution has seen less focus recently, but seems to
>> get
>> a bit more community focus, because some organizations
>>  that contribute a lot want to use the batch side as well. For example
>> the effort on file-grained recovery will strengthen batch a lot already.
>>
>>
>> Stephan
>>
>>
>>
>> On Tue, Mar 14, 2017 at 1:38 PM, Theodore Vasiloudis <
>> theodoros.vasilou...@gmail.com> wrote:
>>
>> Hello all,
>>>
>>> ## Executive summary:
>>>
>>> - Offline-on-streaming most popular, then online and model serving.
>>> - Need shepherds to lead development/coordination of each task.
>>> - I can shepherd online learning, need shepherds for the other two.
>>>
>>>
>>> so from the people sharing their opinion it seems most people would like
>>> to
>>> try out offline learning with the streaming API.
>>> I also think this is an interesting option, but probably the most risky
>>> of
>>> the bunch.
>>>
>>> After that online learning and model serving seem to have around the same
>>> amount of interest.
>>>
>>> Given that, and the discussions we had in the Gdoc, here's what I
>>> recommend
>>> as next actions:
>>>
>>> -
>>> *Offline on streaming: *Start by creating a design document, with an MVP
>>> specification about what we
>>> imagine such a library to look like and what we think should be
>>> possible
>>> to do.
>>> It should state clear goals and limitations; scoping the amount of
>>> work
>>> is
>>> more important at this point than specific engineering choices.
>>> -
>>> *Online learning: *If someone would like instead to work on online
>>> learning
>>> I can help out there,
>>> I have one student working on such a library right now, and I'm sure
>>> people
>>> at TU Berlin (Felix?) have similar efforts. Ideally we would like to
>>> communicate with
>>> them. Since this is a much more explored space, we could jump
>>> straight
>>> into a te

Re: Machine Learning on Flink - Next steps

2017-03-14 Thread Theodore Vasiloudis
s next week
> on
> > > matrix factorization. Naturally, I've run into issues. E.g. I could
> only
> > > mark the end of input with some hacks, because this is not needed at a
> > > streaming job consuming input forever. AFAIK, this would be resolved by
> > > side inputs [1].
> > >
> > > @Theodore:
> > > +1 for doing the prototype project(s) separately the main Flink
> > > repository. Although, I would strongly suggest to follow Flink
> > development
> > > guidelines as closely as possible. As another note, there is already a
> > > GitHub organization for Flink related projects [2], but it seems like
> it
> > > has not been used much.
> > >
> > > [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+
> > > Side+Inputs+for+DataStream+API
> > > [2] https://github.com/project-flink
> > >
> > >
> > > On 2017-03-04 08:44, Roberto Bentivoglio wrote:
> > >
> > > Hi All,
> > >>
> > >> I'd like to start working on:
> > >>   - Offline learning with Streaming API
> > >>   - Online learning
> > >>
> > >> I think also that using a new organisation on github, as Theodore
> > propsed,
> > >> to keep an initial indipendency to speed up the prototyping and
> > >> development
> > >> phases it's really interesting.
> > >>
> > >> I totally agree with Katherin, we need offline learning, but my
> opinion
> > is
> > >> that it will be more straightforward to fix the streaming issues than
> > >> batch
> > >> issues because we will have more support on that by the Flink
> community.
> > >>
> > >> Thanks and have a nice weekend,
> > >> Roberto
> > >>
> > >> On 3 March 2017 at 20:20, amir bahmanyari <amirto...@yahoo.com.invalid
> >
> > >> wrote:
> > >>
> > >> Great points to start:- Online learning
> > >>>- Offline learning with the streaming API
> > >>>
> > >>> Thanks + have a great weekend.
> > >>>
> > >>>From: Katherin Eri <katherinm...@gmail.com>
> > >>>   To: dev@flink.apache.org
> > >>>   Sent: Friday, March 3, 2017 7:41 AM
> > >>>   Subject: Re: Machine Learning on Flink - Next steps
> > >>>
> > >>> Thank you, Theodore.
> > >>>
> > >>> Shortly speaking I vote for:
> > >>> 1) Online learning
> > >>> 2) Low-latency prediction serving -> Offline learning with the batch
> > API
> > >>>
> > >>> In details:
> > >>> 1) If streaming is strong side of Flink lets use it, and try to
> support
> > >>> some online learning or light weight inmemory learning algorithms.
> Try
> > to
> > >>> build pipeline for them.
> > >>>
> > >>> 2) I think that Flink should be part of production ecosystem, and if
> > now
> > >>> productions require ML support, multiple models deployment and so on,
> > we
> > >>> should serve this. But in my opinion we shouldn’t compete with such
> > >>> projects like PredictionIO, but serve them, to be an execution core.
> > But
> > >>> that means a lot:
> > >>>
> > >>> a. Offline training should be supported, because typically most of ML
> > >>> algs
> > >>> are for offline training.
> > >>> b. Model lifecycle should be supported:
> > >>> ETL+transformation+training+scoring+exploitation quality monitoring
> > >>>
> > >>> I understand that batch world is full of competitors, but for me that
> > >>> doesn’t mean that batch should be ignored. I think that separated
> > >>> streaming/batching applications causes additional deployment and
> > >>> exploitation overhead which typically tried to be avoided. That means
> > >>> that
> > >>> we should attract community to this problem in my opinion.
> > >>>
> > >>>
> > >>> пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis <
> > >>> theodoros.vasilou...@gmail.com>:
> > >>>
> > >>> Hello all,
> > >>>
> > >>>  From our previous discussion started by Stavros, we decided to
> start a
> > >>> planning document [1]
> > >>> to figure out possible next steps for ML on Flink.
> > >>>
> > >>> Our concerns where mainly ensuring active development while
> satisfying
> > >>> the
> > >>> needs of
> > >>> the community.
> > >>>
> > >>> We have listed a number of proposals for future work in the document.
> > In
> > >>> short they are:
> > >>>
> > >>>- Offline learning with the batch API
> > >>>- Online learning
> > >>>- Offline learning with the streaming API
> > >>>- Low-latency prediction serving
> > >>>
> > >>> I saw there is a number of people willing to work on ML for Flink,
> but
> > >>> the
> > >>> truth is that we cannot
> > >>> cover all of these suggestions without fragmenting the development
> too
> > >>> much.
> > >>>
> > >>> So my recommendation is to pick out 2 of these options, create design
> > >>> documents and build prototypes for each library.
> > >>> We can then assess their viability and together with the community
> > decide
> > >>> if we should try
> > >>> to include one (or both) of them in the main Flink distribution.
> > >>>
> > >>> So I invite people to express their opinion about which task they
> would
> > >>> be
> > >>> willing to contribute
> > >>> and hopefully we can settle on two of these options.
> > >>>
> > >>> Once that is done we can decide how we do the actual work. Since this
> > is
> > >>> highly experimental
> > >>> I would suggest we work on repositories where we have complete
> control.
> > >>>
> > >>> For that purpose I have created an organization [2] on Github which
> we
> > >>> can
> > >>> use to create repositories and teams that work on them in an
> organized
> > >>> manner.
> > >>> Once enough work has accumulated we can start discussing contributing
> > the
> > >>> code
> > >>> to the main distribution.
> > >>>
> > >>> Regards,
> > >>> Theodore
> > >>>
> > >>> [1]
> > >>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3U
> > >>> d06MIRhahtJ6dw/
> > >>> [2] https://github.com/flinkml
> > >>>
> > >>> --
> > >>>
> > >>> *Yours faithfully, *
> > >>>
> > >>> *Kate Eri.*
> > >>>
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > >
> >
>


Machine Learning on Flink - Next steps

2017-03-03 Thread Theodore Vasiloudis
Hello all,

>From our previous discussion started by Stavros, we decided to start a
planning document [1]
to figure out possible next steps for ML on Flink.

Our concerns where mainly ensuring active development while satisfying the
needs of
the community.

We have listed a number of proposals for future work in the document. In
short they are:

   - Offline learning with the batch API
   - Online learning
   - Offline learning with the streaming API
   - Low-latency prediction serving

I saw there is a number of people willing to work on ML for Flink, but the
truth is that we cannot
cover all of these suggestions without fragmenting the development too much.

So my recommendation is to pick out 2 of these options, create design
documents and build prototypes for each library.
We can then assess their viability and together with the community decide
if we should try
to include one (or both) of them in the main Flink distribution.

So I invite people to express their opinion about which task they would be
willing to contribute
and hopefully we can settle on two of these options.

Once that is done we can decide how we do the actual work. Since this is
highly experimental
I would suggest we work on repositories where we have complete control.

For that purpose I have created an organization [2] on Github which we can
use to create repositories and teams that work on them in an organized
manner.
Once enough work has accumulated we can start discussing contributing the
code
to the main distribution.

Regards,
Theodore

[1]
https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/
[2] https://github.com/flinkml


Re: [DISCUSS] Flink ML roadmap

2017-03-03 Thread Theodore Vasiloudis
It seems like a relatively new project, backed by Intel.

My impression from the doc Roberto linked is that they might switch to
using Beam instead of Spark (?)

I'm cc'ing Soila who is developer of TAP and has worked on FlinkML in the
past, perhaps she has some input on how they plan to work with streaming
and ML in TAP.

Repos:
[1] https://github.com/tapanalyticstoolkit/

On Fri, Mar 3, 2017 at 12:24 PM, Stavros Kontopoulos <
st.kontopou...@gmail.com> wrote:

> Interesting  thanx @Roberto.  I see that only TAP Analytics Toolkit
> supports streaming. I am not aware of its market share, anyone?
>
> Best,
> Stavros
>
> On Fri, Mar 3, 2017 at 11:50 AM, Theodore Vasiloudis <
> theodoros.vasilou...@gmail.com> wrote:
>
> > Thank you for the links Roberto I did not know that Beam was working on
> an
> > ML abstraction as well. I'm sure we can learn from that.
> >
> > I'll start another thread today where we can discuss next steps and
> action
> > points now that we have a few different paths to follow listed on the
> > shared doc,
> > since our deadline was today. We welcome further discussions of course.
> >
> > Regards,
> > Theodore
> >
> > On Thu, Mar 2, 2017 at 10:52 AM, Roberto Bentivoglio <
> > roberto.bentivog...@radicalbit.io> wrote:
> >
> > > Hi All,
> > >
> > > First of all I'd like to introduce myself: my name is Roberto
> Bentivoglio
> > > and I'm currently working for Radicalbit as Andrea Spina (he already
> > wrote
> > > on this thread).
> > > I didn't have the chance to directly contribute on Flink up to now, but
> > > some colleagues of mine are doing that since at least one year (they
> > > contributed also on the machine learning library).
> > >
> > > I hope I'm not jumping into discussione too late, it's really
> interesting
> > > and the analysis document is depicting really well the scenarios
> > currently
> > > available. Many thanks for your effort!
> > >
> > > If I can add my two cents to the discussion I'd like to add the
> > following:
> > >  - it's clear that currently the Flink community is deeply focused on
> > > streaming features than batch features. For this reason I think that
> > > implement "Offline learning with Streaming API" is really a great idea.
> > >  - I think that the "Online learning" option is really a good fit for
> > > Flink, but maybe we could give at the beginning an higher priority to
> the
> > > "Offline learning with Streaming API" option. However I think that this
> > > option will be the main goal for the mid/long term.
> > >  - we implemented a library based on jpmml-evaluator[1] and flink
> called
> > > "flink-jpmml". Using this library you can train the models on external
> > > systems and use those models, after you've exported in a PMML standard
> > > format, to run evaluations on top of DataStream API. We don't have open
> > > sourced this library up to now, but we're planning to do this in the
> next
> > > weeks. We'd like to complete the documentation and the final code
> reviews
> > > before to share it. I hope it will be helpful for the community to
> > enhance
> > > the ML support on Flink
> > >  - I'd like also to mention that the Apache Beam community is thiking
> on
> > a
> > > ML DSL. There is a design document and a couple of Jira tasks for that
> > > [2][3]
> > >
> > > We're really keen to focus our effort to improve the ML support on
> Flink
> > in
> > > Radicalbit, we will contribute on this effort for sure on a regular
> basis
> > > with our team.
> > >
> > > Looking forward to work with you!
> > >
> > > Many thanks,
> > > Roberto
> > >
> > > [1] - https://github.com/jpmml/jpmml-evaluator
> > > [2] -
> > > https://docs.google.com/document/d/17cRZk_
> yqHm3C0fljivjN66MbLkeKS1yjo4PB
> > > ECHb-xA
> > > [3] - https://issues.apache.org/jira/browse/BEAM-303
> > >
> > > On 28 February 2017 at 19:35, Gábor Hermann <m...@gaborhermann.com>
> > wrote:
> > >
> > > > Hi Philipp,
> > > >
> > > > It's great to hear you are interested in Flink ML!
> > > >
> > > > Based on your description, your prototype seems like an interesting
> > > > approach for combining online+offline learning. If you're interested,
> > we
> > > > might find a way to integrate your work, or at least y

Re: [DISCUSS] Flink ML roadmap

2017-03-03 Thread Theodore Vasiloudis
ML [2,3,4,5].
> >>>
> >>> [1] http://sf.flink-forward.org/program/sessions/
> >>> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-str
> >>> eaming-vs-micro-batch-for-online-learning/
> >>> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-te
> >>> nsorflow/
> >>> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le
> >>> arning-on-flink/
> >>> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn
> >>> ing-scenarios-with-flink/
> >>>
> >>> Cheers,
> >>> Gabor
> >>>
> >>>
> >>> On 2017-02-23 15:19, Katherin Eri wrote:
> >>>
> >>>> I have asked already some teams for useful cases, but all of them need
> >>>> time
> >>>> to think.
> >>>> During analysis something will finally arise.
> >>>> May be we can ask partners of Flink  for cases? Data Artisans got
> >>>> results
> >>>> of customers survey: [1], ML better support is wanted, so we could ask
> >>>> what
> >>>> exactly is necessary.
> >>>>
> >>>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
> >>>>
> >>>> 23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos" <
> >>>> st.kontopou...@gmail.com> написал:
> >>>>
> >>>> +100 for a design doc.
> >>>>>
> >>>>> Could we also set a roadmap after some time-boxed investigation
> >>>>> captured in
> >>>>> that document? We need action.
> >>>>>
> >>>>> Looking forward to work on this (whatever that might be) ;) Also are
> >>>>> there
> >>>>> any data supporting one direction or the other from a customer
> >>>>> perspective?
> >>>>> It would help to make more informed decisions.
> >>>>>
> >>>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <
> katherinm...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>> Yes, ok.
> >>>>>> let's start some design document, and write down there already
> >>>>>> mentioned
> >>>>>> ideas about: parameter server, about clipper and others. Would be
> >>>>>> nice if
> >>>>>> we will also map this approaches to cases.
> >>>>>> Will work on it collaboratively on each topic, may be finally we
> will
> >>>>>>
> >>>>> form
> >>>>>
> >>>>>> some picture, that could be agreed with committers.
> >>>>>> @Gabor, could you please start such shared doc, as you have already
> >>>>>>
> >>>>> several
> >>>>>
> >>>>>> ideas proposed?
> >>>>>>
> >>>>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <m...@gaborhermann.com>:
> >>>>>>
> >>>>>> I agree, that it's better to go in one direction first, but I think
> >>>>>>> online and offline with streaming API can go somewhat parallel
> later.
> >>>>>>>
> >>>>>> We
> >>>>>
> >>>>>> could set a short-term goal, concentrate initially on one direction,
> >>>>>>>
> >>>>>> and
> >>>>>
> >>>>>> showcase that direction (e.g. in a blogpost). But first, we should
> >>>>>>> list
> >>>>>>> the pros/cons in a design doc as a minimum. Then make a decision
> what
> >>>>>>> direction to go. Would that be feasible?
> >>>>>>>
> >>>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
> >>>>>>>
> >>>>>>> I'm not sure that this is feasible, doing all at the same time
> could
> >>>>>>>>
> >>>>>>> mean
> >>>>>>
> >>>>>>> doing nothing
> >>>>>>>> I'm just afraid, that words: we will work on streaming not on
> >>>>>>>>
> >>>>>>> batching,
> >>>>>
> >>>>>> we
> >>>>>>>
> >>>>>>>> have no commiter's time for t

Re: [DISCUSS] Flink ML roadmap

2017-02-23 Thread Theodore Vasiloudis
 to make more informed decisions.
>>>>>
>>>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <katherinm...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Yes, ok.
>>>>>
>>>>>> let's start some design document, and write down there already
>>>>>> mentioned
>>>>>> ideas about: parameter server, about clipper and others. Would be nice
>>>>>> if
>>>>>> we will also map this approaches to cases.
>>>>>> Will work on it collaboratively on each topic, may be finally we will
>>>>>>
>>>>>> form
>>>>>
>>>>> some picture, that could be agreed with committers.
>>>>>> @Gabor, could you please start such shared doc, as you have already
>>>>>>
>>>>>> several
>>>>>
>>>>> ideas proposed?
>>>>>>
>>>>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <m...@gaborhermann.com>:
>>>>>>
>>>>>> I agree, that it's better to go in one direction first, but I think
>>>>>>
>>>>>>> online and offline with streaming API can go somewhat parallel later.
>>>>>>>
>>>>>>> We
>>>>>> could set a short-term goal, concentrate initially on one direction,
>>>>>> and
>>>>>> showcase that direction (e.g. in a blogpost). But first, we should
>>>>>> list
>>>>>>
>>>>>>> the pros/cons in a design doc as a minimum. Then make a decision what
>>>>>>> direction to go. Would that be feasible?
>>>>>>>
>>>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
>>>>>>>
>>>>>>> I'm not sure that this is feasible, doing all at the same time could
>>>>>>> mean
>>>>>>> doing nothing
>>>>>>>
>>>>>>>> I'm just afraid, that words: we will work on streaming not on
>>>>>>>>
>>>>>>>> batching,
>>>>>>>
>>>>>> we
>>>>>>
>>>>>>> have no commiter's time for this, mean that yes, we started work on
>>>>>>>> FLINK-1730, but nobody will commit this work in the end, as it
>>>>>>>>
>>>>>>>> already
>>>>>>>
>>>>>> was
>>>>>>
>>>>>>> with this ticket.
>>>>>>>>
>>>>>>>> 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <
>>>>>>>>
>>>>>>>> m...@gaborhermann.com>
>>>>>>>
>>>>>>> написал:
>>>>>>>>
>>>>>>>> @Theodore: Great to hear you think the "batch on streaming" approach
>>>>>>>> is
>>>>>>>>
>>>>>>> possible! Of course, we need to pay attention all the pitfalls
>>>>>>>
>>>>>>>> there,
>>>>>>>>
>>>>>>> if we
>>>>>>
>>>>>>> go that way.
>>>>>>>>
>>>>>>>>> +1 for a design doc!
>>>>>>>>>
>>>>>>>>> I would add that it's possible to make efforts in all the three
>>>>>>>>>
>>>>>>>>> directions
>>>>>>>> (i.e. batch, online, batch on streaming) at the same time. Although,
>>>>>>>> it
>>>>>>>>
>>>>>>> might be worth to concentrate on one. E.g. it would not be so useful
>>>>>>>
>>>>>>>> to
>>>>>>>>
>>>>>>> have the same batch algorithms with both the batch API and streaming
>>>>>>>
>>>>>>>> API.
>>>>>>>> We can decide later.
>>>>>>>>
>>>>>>>>> The design doc could be partitioned to these 3 directions, and we
>>>>>>>>>
>>>>>>>>> can
>>>>>>>>
>>>>>>> collect there the pros/cons too. What do you think?
>>>>>>
>>>>>>> Cheers,
>>>>>>>>> Gab

Re: [DISCUSS] Flink ML roadmap

2017-02-23 Thread Theodore Vasiloudis
Hello all,


@Gabor, we have discussed the idea of using the streaming API to write all
of our ML algorithms with a couple of people offline,
and I think it might be possible and is generally worth a shot. The
approach we would take would be close to Vowpal Wabbit, not exactly
"online", but rather "fast-batch".

There will be problems popping up again, even for very simple algos like on
line linear regression with SGD [1], but hopefully fixing those will be
more aligned with the priorities of the community.

@Katherin, my understanding is that given the limited resources, there is
no development effort focused on batch processing right now.

So to summarize, it seems like there are people willing to work on ML on
Flink, but nobody is sure how to do it.
There are many directions we could take (batch, online, batch on
streaming), each with its own merits and downsides.

If you want we can start a design doc and move the conversation there, come
up with a roadmap and start implementing.

Regards,
Theodore

[1]
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Understanding-connected-streams-use-without-timestamps-td10241.html

On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann 
wrote:

> It's great to see so much activity in this discussion :)
> I'll try to add my thoughts.
>
> I think building a developer community (Till's 2. point) can be slightly
> separated from what features we should aim for (1. point) and showcasing
> (3. point). Thanks Till for bringing up the ideas for restructuring, I'm
> sure we'll find a way to make the development process more dynamic. I'll
> try to address the rest here.
>
> It's hard to choose directions between streaming and batch ML. As Theo has
> indicated, not much online ML is used in production, but Flink concentrates
> on streaming, so online ML would be a better fit for Flink. However, as
> most of you argued, there's definite need for batch ML. But batch ML seems
> hard to achieve because there are blocking issues with persisting,
> iteration paths etc. So it's no good either way.
>
> I propose a seemingly crazy solution: what if we developed batch
> algorithms also with the streaming API? The batch API would clearly seem
> more suitable for ML algorithms, but there a lot of benefits of this
> approach too, so it's clearly worth considering. Flink also has the high
> level vision of "streaming for everything" that would clearly fit this
> case. What do you all think about this? Do you think this solution would be
> feasible? I would be happy to make a more elaborate proposal, but I push my
> main ideas here:
>
> 1) Simplifying by using one system
> It could simplify the work of both the users and the developers. One could
> execute training once, or could execute it periodically e.g. by using
> windows. Low-latency serving and training could be done in the same system.
> We could implement incremental algorithms, without any side inputs for
> combining online learning (or predictions) with batch learning. Of course,
> all the logic describing these must be somehow implemented (e.g.
> synchronizing predictions with training), but it should be easier to do so
> in one system, than by combining e.g. the batch and streaming API.
>
> 2) Batch ML with the streaming API is not harder
> Despite these benefits, it could seem harder to implement batch ML with
> the streaming API, but in my opinion it's not. There are more flexible,
> lower-level optimization potentials with the streaming API. Most
> distributed ML algorithms use a lower-level model than the batch API
> anyway, so sometimes it feels like forcing the algorithm logic into the
> training API and tweaking it. Although we could not use the batch
> primitives like join, we would have the E.g. in my experience with
> implementing a distributed matrix factorization algorithm [1], I couldn't
> do a simple optimization because of the limitations of the iteration API
> [2]. Even if we pushed all the development effort to make the batch API
> more suitable for ML there would be things we couldn't do. E.g. there are
> approaches for updating a model iteratively without locks [3,4] (i.e.
> somewhat asynchronously), and I don't see a clear way to implement such
> algorithms with the batch API.
>
> 3) Streaming community (users and devs) benefit
> The Flink streaming community in general would also benefit from this
> direction. There are many features needed in the streaming API for ML to
> work, but this is also true for the batch API. One really important is the
> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot of
> effort (mostly from Paris) for making it mature enough [6]. Kate mentioned
> using GPUs, and I'm sure they have uses in streaming generally [7]. Thus,
> by improving the streaming API to allow ML algorithms, the streaming API
> benefit too (which is important as they have a lot more production users
> than the batch API).
>
> 4) Performance can be at least 

Re: [DISCUSS] Project build time and possible restructuring

2017-02-21 Thread Theodore Vasiloudis
Hello all,

>From a library developer POV I think splitting up the project will have
more advantages than disadvantages.
Api breaking things should move to be the responsibility of library
developers, and with automated tests
they shouldn't be too hard to catch.

I think I'm more fin favor of synced releases to not confuse users. If we
are going to be presenting the Flink stack
as an integrated product, as a user I would expect everything to be under
one release schedule and not
have to worry about different versions of different parts of the stack.

If we were to split how does that work under the ASF? Is it possible to
have someone be a committer for
a library but not for the core?

Regards,
Theodore


On Tue, Feb 21, 2017 at 1:44 PM, Till Rohrmann  wrote:

> Hi Flink community,
>
> I'd like to revive a discussion about Flink's build time and project
> structure which we already had in some other mailing thread [1] and which
> we wanted do after the 1.2 release.
>
> Recently, we can see that Flink is exceeding more and more often Travis
> maximum build time of 50 minutes. This leads to failing builds as it can be
> seen here [2]. Almost 50 % of my last builds on Travis failed because of
> the 50 minutes time limit.
>
> The excess of the time limit not only prevents some tests (especially the
> yarn tests) to be executed regularly but it also undermines the people's
> trust into CI. We've seen in the past that when we had some flakey tests
> that there was an acceptance to merge PRs even though Travis failed because
> the failing tests were "always" unrelated. But how sure can you be about
> that? Having a properly working and reliable CI system is imo crucial for
> guaranteeing Flink's high quality standard.
>
> In the past we've split Flink's tests into two groups which are executed
> separately in order to cope with increasing build times. This could again
> be a solution to the problem.
>
> However, there is also another problem of slowly increasing build times for
> Flink. On my machine building Flink with deactivated tests takes about 10
> minutes. That's mainly because Flink has grown quite big containing now not
> only the runtime and apis but also several libraries and a contribution
> module. Stephan proposed to split up the repository into the following set
>
>   - flink-core (core, apis, runtime, clients)
>   - flink-libraries (gelly, ml, cep, table, scala shell, python)
>   - flink-connectors
>   - flink-contrib
>
> in order to make the project better maintainable and decreasing build as
> well as test times. Of course such a split would raise the question how and
> how often the individual modules are released. Will they follow an
> independent release cycle or will they be synched? Moreover, the problem of
> API stability across module boundaries will arise. Changing things in the
> core repository might break things in a library repository and since they
> are independent this break might go unnoticed for some time. Stephan's
> proposal also includes that the new repositories will be governed by the
> same PMC.
>
> A little bit off-topic but also somewhat related is how we handle the load
> of outside contributions for modules where we don't have many committers
> present. Good examples (actually they are bad examples for community work)
> are the ML and the CEP library. These libraries started promising and
> attracted outside contributions. However, due to a lack of committers who
> could spend time on these libraries, their development stalled and made
> many contributors turn away from it. Maybe such a split makes things easier
> wrt to making more contributors committers. Moreover, an independent
> release cycle for volatile projects might help increasing adoption, because
> bug-fixes can be delivered more frequently.
>
> Recently, I've seen an increased interest in and really good discussions
> about FlinkML's future [3]. I really would not like to repeat the same
> mistakes and let this effort die again by simply being not responsive to
> contributors who would like to get involved. The only way I see this
> happening is to add more committers to the ML library. And maybe we feel
> more comfortable adding new committers faster to repos which are not Flink
> core.
>
> I know we should first discuss the former problem and find a conclusion
> there. But I mentioned the outside contributors problem as well because it
> is an argument for a repo split.
>
> [1]
> http://apache-flink-mailing-list-archive.1008284.n3.
> nabble.com/Travis-CI-tt14478.html
> [2] https://travis-ci.org/tillrohrmann/flink/builds/203479275
> [3]
> http://apache-flink-mailing-list-archive.1008284.n3.
> nabble.com/DISCUSS-Flink-ML-roadmap-tt16040.html
>
> Cheers,
> Till
>


Re: [DISCUSS] Flink ML roadmap

2017-02-21 Thread Theodore Vasiloudis
Thank you all for your thoughts on the matter.

Andrea brought up some further engine considerations that we need to
address in order to have a competitive ML engine on Flink.

I'm happy to see many people willing to contribute to the development of ML
on Flink. The way I see it, there needs to be buy-in from the rest of the
community for such changes to go through.

If then you are interested in helping out, tackling one of the issues
mentioned in my previous email or the ones mentioned by Andrea are the most
critical ones, as they require making changes to the core.

If you want to take up one of those issues the best way is to start a
conversation on the list, and gauge the opinion of the community.

Finally, as Stavros mentioned, we need to come up with an updated roadmap
for FlinkML that includes these issues.

@Andrea, the idea of an online learning library for Flink has been broached
before, and this semester I have one Master student working on exactly
that. From my conversations with people in the industry however, almost
nobody uses online learning in production, at best models are updated every
5 minutes. So the impact would probably not be very large.

I would like to bring up again the topic of model serving that I think fits
the Flink use-case much better. Developing a system like Clipper [1] on top
of Flink could be one of the best ways to use Flink for ML.

Regards,
Theodore

[1]  Clipper: A Low-Latency Online Prediction Serving System -
https://arxiv.org/abs/1612.03079

On Tue, Feb 21, 2017 at 12:10 AM, Andrea Spina 
wrote:

> Hi all,
>
> Thanks Stavros for pushing forward the discussion which I feel really
> relevant.
>
> Since I'm approaching actively the community just right now and I haven't
> enough experience and such visibility around the Flink community, I'd limit
> myself to share an opinion as a Flink user.
>
> I'm using Flink since almost a year along two different experiences, but
> I've bumped into the question "how to handle ML workloads and keep Flink as
> the main engine?" in both cases. Then the first point raises in my mind:
> why
> do I need to adopt an extra system for purely ML purposes: how amazing
> could
> be to benefit the Flink engine as ML features provider and to avoid paying
> the effort to maintain an additional engine? This thought links also @Timur
> opinion: I believe that users would prefer way more a unified architecture
> in this case. Even if a user want to use an external tool/library - perhaps
> providing additional language support (e.g. R) - so that user should be
> capable to run it on top of Flink.
>
> Along my work with Flink I needed to implement some ML algorithms on both
> Flink and Spark and I often struggled with Flink performances: namely, I
> think (in the name of the bigger picture) we should first focus the effort
> on solving some well-known Flink limitations as @theodore pinpointed. I'd
> like to highlight [1] and [2] which I find relevant. Since the community
> would decide to go ahead with FlinkML I believe fixing the above described
> issues may be a good starting point. That would also definitely push
> forward
> some important integrations as Apache SystemML.
>
> Given all these points, I'm increasingly convinced that Online Machine
> Learning would be the real final objective and the more suitable goal since
> we're talking about a real-time streaming engine and - from a real high
> point of view - I believe Flink would fit this topic in a more genuine way
> than the batch case. We've a connector for Apache SAMOA, but it seems in an
> early stage of development IMHO and not really active. If we want to make
> something within Flink instead, we need to speed up the design of some
> features (e.g. side inputs [3]).
>
> I really hope we can define a new roadmap by which we can finally push
> forward the topic. I will put my best to help in this way.
>
> Sincerely,
> Andrea
>
> [1] Add a FlinkTools.persist style method to the Data Set
> https://issues.apache.org/jira/browse/FLINK-1730
> [2] Only send data to each taskmanager once for broadcasts
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-
> 5%3A+Only+send+data+to+each+taskmanager+once+for+broadcasts
> [3] Side inputs - Evolving or static Filter/Enriching
> https://docs.google.com/document/d/1hIgxi2Zchww_5fWUHLoYiXwSBXjv-M5eOv-
> MKQYN3m4/edit#
> http://apache-flink-mailing-list-archive.1008284.n3.
> nabble.com/DISCUSS-Add-Side-Input-Broadcast-Set-For-
> Streaming-API-td11529.html
>
>
>
> --
> View this message in context: http://apache-flink-mailing-
> list-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-
> roadmap-tp16040p16064.html
> Sent from the Apache Flink Mailing List archive. mailing list archive at
> Nabble.com.
>


Re: [DISCUSS] Flink ML roadmap

2017-02-20 Thread Theodore Vasiloudis
Hello all,

thank you for opening this discussion Stavros, note that it's almost
exactly 1 year since I last opened such a topic (linked by Gabor) and the
comments there are still relevant.

I think Gabor described the current state quite well, development in the
libraries is hard without committers dedicated to each project, and as a
result FlinkML and CEP have stalled.

I think it's important to look at why development has stalled as well. As
people have mentioned there's a multitude of ML libraries out there and my
impression was that not many people are looking to use Flink for ML. Lately
that seems to have changed (with some interest shown in the Flink survey as
well).

Gabor makes some good points about future directions for the library. Our
initial goal [1] was to make a truly scalable, easy to use library, within
the Flink ecosystem, providing a set of "workhorse" algorithms, sampled
from what's actually being used in the industry. We planned for a library
that has few algorithms, but does them properly.

If we decide to go the way of focusing within Flink we face some major
challenges, because these are system limitations that do not necessarily
align with the goals of the community. Some issues relevant to ML on Flink
are:

   - FLINK-2396 - Review the datasets of dynamic path and static path in
   iteration.
   https://issues.apache.org/jira/browse/FLINK-2396
   This has to do with the ability to iterate over one datset (model) while
   changing another (dataset), which is necessary for many ML algorithms like
   SGD.
   - FLINK-1730 - Add a FlinkTools.persist style method to the Data Set.
   https://issues.apache.org/jira/browse/FLINK-1730
   This is again relevant to many algorithms, to create intermediate
   results etc, for example L-BFGS development has been attempted 2-3 times,
   but always abandoned because of the need to collect a DataSet kills the
   performance.
   - FLINK-5782 - Support GPU calculations
   https://issues.apache.org/jira/browse/FLINK-5782
   Many algorithms will benefit greatly by GPU-accelerated linear algebra,
   to the point where if a library doesn't support it puts it at a severe
   disadvantage compared to other offerings.


These issues aside, Stephan has mentioned recently the possibility of
re-structuring the Flink project to allow for more flexibility for the
libraries. I think that sounds quite promising and it should allow the
development to pick up in the libraries, if we can get some more people
reviewing and merging PRs.

I would be all for updating our vision and roadmap to match what the
community desires from the library.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FlinkML%3A+Vision+and+Roadmap

On Mon, Feb 20, 2017 at 12:47 PM, Gábor Hermann 
wrote:

> Hi Stavros,
>
> Thanks for bringing this up.
>
> There have been past [1] and recent [2, 3] discussions about the Flink
> libraries, because there are some stalling PRs and overloaded committers.
> (Actually, Till is the only committer shepherd of the both the CEP and ML
> library, and AFAIK he has a ton of other responsibilities and work to do.)
> Thus it's hard to get code reviewed and merged, and without merged code
> it's hard to get a committer status, so there are not many committers who
> can review e.g. ML algorithm implementations, and the cycle goes on. Until
> this is resolved somehow, we should help the committers by reviewing
> each-others PRs.
>
> I think prioritizing features (b) is a good way to start. We could declare
> most blocking features and concentrate on reviewing and merging them before
> moving forward. E.g. the evaluation framework is quite important for an ML
> library in my opinion, and has a PR stalling for long [4].
>
> Regarding c),  there are styleguides generally for contributing to Flink,
> so we should follow that. Is there something more ML specific you think we
> could follow? We should definitely declare, we follow scikit-learn and make
> sure contributions comply to that.
>
> In terms of features (a, d), I think we should first see the bigger
> picture. That is, it would be nice to discuss a clearer direction for Flink
> ML. I've seen a lot of interest in contributing to Flink ML lately. I
> believe we should rethink our goals, to put the contribution efforts in
> making a usable and useful library. Are we trying to implement as many
> useful algorithms as possible to create a scalable ML library? That would
> seem ambitious, and of course there are a lot of frameworks and libraries
> that already has something like this as goal (e.g. Spark MLlib, Mahout).
> Should we rather create connectors to existing libraries? Then we cannot
> really do Flink specific optimizations. Should we go for online machine
> learning (as Flink is concentrating on streaming)? We already have a
> connector to SAMOA. We could go on with questions like this. Maybe I'm
> missing something, but I haven't seen such directions declared.
>
> Cheers,
> Gabor

Re: Using QueryableState inside Flink jobs (and Parameter Server implementation)

2017-02-14 Thread Theodore Vasiloudis
Hello all,

I would also be really interested in how a PS-like architecture would work
in Flink. Note that we not necessarily talking about PS, but generally how
QueryableState can be used for ML tasks with I guess a focus on
model-parallel training.

One suggestion I would make is to take a look at Tensorflow which is also a
dataflow model that has support for distributed computation, both data and
model parallel.

I don't know too much about the internal workings of the system, but I
would point out this from the TF whitepaper [1], Section 11 related work:

It also permits a significant simplification by
> allowing  the  expression  of  stateful  parameter  nodes  as
> variables,  and  variable  update  operations  that  are  just
> additional  nodes  in  the  graph;  in  contrast,  DistBelief,
> Project Adam and the Parameter Server systems all have
> whole separate parameter server subsystems devoted to
> communicating and updating parameter values.
>

I think the Related work section is the most relevant to this discussion as
it discusses the differences between the programming models in Spark, Naiad
etc. to the TF model.

Also re. fault tolerance:

When a failure is detected, the entire graph execution
> is  aborted  and  restarted  from  scratch.   Recall  however
> that Variable nodes refer to tensors that persist across ex-
> ecutions of the graph. We support consistent checkpoint-
> ing and recovery of this state on a restart.  In particular,
> each Variable node is connected to a Save node.  These
> Save nodes are executed periodically, say once every N
> iterations, or once every N seconds. When they execute,
> the contents of the variables are written to persistent stor-
> age, e.g., a distributed file system.  Similarly each Vari-
> able is connected to a Restore node that is only enabled
> in the first iteration after a restart.
>

[1] http://download.tensorflow.org/paper/whitepaper2015.pdf

On Tue, Feb 14, 2017 at 3:18 PM, Gábor Hermann 
wrote:

> Hey Ufuk,
>
> I'm happy to contribute. At least I'll get a bit more understanding of the
> details.
>
> Breaking the assumption that only a single thread updates state would
> brings us from strong isolation guarantees (i.e. serializability at the
> updates and read committed at the external queries) to no isolation
> guarantees. That's not something to be taken lightly. I think that these
> guarantees would be more easily provided for inside queries that modify
> (setKvState), but that's still not trivial.
>
> Indeed, the iteration approach works better for the use-cases I mentioned,
> at least for now.
>
> Cheers,
> Gabor
>
>
> On 2017-02-14 14:43, Ufuk Celebi wrote:
>
> Hey Gabor,
>>
>> great ideas here. It's only slightly related, but I'm currently working
>> on a proposal to improve the queryable state APIs for lookups (partly along
>> the lines of what you suggested with higher level accessors). Maybe you are
>> interested in contributing there?
>>
>> I really like your ideas for the use cases you describe, but I'm unsure
>> about the write path (setKvState), because of the discussed implications to
>> the state backends. I think that this needs more discussion and
>> coordination with the contributors working on the backends. For example,
>> one assumption so far was that only a single thread updates state and we
>> don't scope state per checkpoint (to provide "isolation levels" for the
>> queries; read comitted vs. read uncommitted) and probably more.
>>
>> Because of this I would actually lean towards the iteration approach in a
>> first version. Would that be a feasible starting point for you?
>>
>> – Ufuk
>>
>> On 14 February 2017 at 14:01:21, Gábor Hermann (m...@gaborhermann.com)
>> wrote:
>>
>>> Hi Gyula, Jinkui Shi,
>>>   Thanks for your thoughts!
>>>   @Gyula: I'll try and explain a bit more detail.
>>>   The API could be almost like the QueryableState's. It could be
>>> higher-level though: returning Java objects instead of serialized data
>>> (because there would not be issues with class loading). Also, it could
>>> support setKvState (see my 5. point). This could lead to both a
>>> potential performance improvements and easier usage (see my points 2.
>>> and 3.).
>>>   A use-case could be anything where we'd use an external KV-store.
>>> For instance we are updating user states based on another user state, so
>>> in the map function we do a query (in pseudo-ish Scala code):
>>>   users.keyBy(0).flatMapWithState { (userEvent, collector) =>
>>> val thisUser: UserState = state.get()
>>> val otherUser: Future[UserState] =
>>> qsClient.getKvState("users", userEvent.otherUserId)
>>>   otherUser.onSuccess { case otherUserState =>
>>> state.update(someFunc1(thisUser, otherUserState))
>>> collector.collect(someFunc2(thisUser, otherUserState))
>>> }
>>> }
>>>   Another example could be (online) distributed matrix factorization,
>>> where the two factor matrices are represented by distributed states. One
>>> is 

Re: New Flink team member - Kate Eri.

2017-02-06 Thread Theodore Vasiloudis
ion of current PR
> **independently **from
> integration to DL4J.*
>
>
>
> Could you please provide your opinion regarding my questions and points,
> what do you think about them?
>
>
>
> пн, 6 февр. 2017 г. в 12:51, Katherin Eri <katherinm...@gmail.com>:
>
> > Sorry, guys I need to finish this letter first.
> >   Full version of it will come shortly.
> >
> > пн, 6 февр. 2017 г. в 12:49, Katherin Eri <katherinm...@gmail.com>:
> >
> > Hello, guys.
> > Theodore, last week I started the review of the PR:
> > https://github.com/apache/flink/pull/2735 related to *word2Vec for
> Flink*.
> >
> > During this review I have asked myself: why do we need to implement such
> a
> > very popular algorithm like *word2vec one more time*, when there is
> > already availabe implementation in java provided by deeplearning4j.org
> > <https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2
> licence).
> > This library tries to promote it self, there is a hype around it in ML
> > sphere, and  it was integrated with Apache Spark, to provide scalable
> > deeplearning calculations.
> > That's why I thought: could we integrate with this library or not also
> and
> > Flink?
> > 1) Personally I think, providing support and deployment of Deeplearning
> > algorithms/models in Flink is promising and attractive feature, because:
> > a) during last two years deeplearning proved its efficiency and this
> > algorithms used in many applications. For example *Spotify *uses DL based
> > algorithms for music content extraction: Recommending music on Spotify
> > with deep learning AUGUST 05, 2014
> > <http://benanne.github.io/2014/08/05/spotify-cnns.html> for their music
> > recommendations. Doing this natively scalable is very attractive.
> >
> >
> > I have investigated that implementation of integration DL4J with Apache
> > Spark, and got several points:
> >
> > 1) It seems that idea of building of our own implementation of word2vec
> > not such a bad solution, because the integration of DL4J with Spark is
> too
> > strongly coupled with Saprk API and it will take time from the side of
> DL4J
> > to adopt this integration to Flink. Also I have expected that we will be
> > able to call just some API, it is not such thing.
> > 2)
> >
> > https://deeplearning4j.org/use_cases
> > https://www.analyticsvidhya.com/blog/2017/01/t-sne-
> implementation-r-python/
> >
> >
> > чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <trohrm...@apache.org>:
> >
> > Hi Katherin,
> >
> > welcome to the Flink community. Always great to see new people joining
> the
> > community :-)
> >
> > Cheers,
> > Till
> >
> > On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <
> katherinm...@gmail.com>
> > wrote:
> >
> > > ok, I've got it.
> > > I will take a look at  https://github.com/apache/flink/pull/2735.
> > >
> > > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> > > theodoros.vasilou...@gmail.com>:
> > >
> > > > Hello Katherin,
> > > >
> > > > Welcome to the Flink community!
> > > >
> > > > The ML component definitely needs a lot of work you are correct, we
> are
> > > > facing similar problems to CEP, which we'll hopefully resolve with
> the
> > > > restructuring Stephan has mentioned in that thread.
> > > >
> > > > If you'd like to help out with PRs we have many open, one I have
> > started
> > > > reviewing but got side-tracked is the Word2Vec one [1].
> > > >
> > > > Best,
> > > > Theodore
> > > >
> > > > [1] https://github.com/apache/flink/pull/2735
> > > >
> > > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <fhue...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Katherin,
> > > > >
> > > > > welcome to the Flink community!
> > > > > Help with reviewing PRs is always very welcome and a great way to
> > > > > contribute.
> > > > >
> > > > > Best, Fabian
> > > > >
> > > > >
> > > > >
> > > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <
> katherinm...@gmail.com
> > >:
> > > > >
> > > > > > Thank you, Timo.
> > > > > > I have started the analysis of the topic.
> > > > > > And if it necessary, 

Re: flink-ml test

2017-01-25 Thread Theodore Vasiloudis
Hello Anton,

I usually run specific local tests through IDEA, or test or the whole ML
module (run mvn test in the flink-ml root dir) . It should be possible to
run specific tests through maven [1], but I haven't been able to make this
work.

Which test is failing for you?

[1]
http://maven.apache.org/surefire/maven-surefire-plugin/examples/single-test.html

On Wed, Jan 25, 2017 at 2:24 PM, Anton Solovev 
wrote:

> Hi folk,
>
> I have a failed integration test on travis in flink-ml package.
> How can I run specific IT Class of this on my local machine with console
> logs output?
>
> Thanks,
> Anton Solovev
>


Re: [DISCUSS] (Not) tagging reviewers

2017-01-24 Thread Theodore Vasiloudis
I was wondering how this relates to the shepherding of PRs we have
discussed in the past. If I make a PR for an issue reported from a specific
committer, doesn't tagging them make sense?

Has the shepherding of PRs been tried out?

On Tue, Jan 24, 2017 at 12:17 PM, Aljoscha Krettek 
wrote:

> It seems I'm in a bit of a minority here but I like the @R tags. There are
> simply to many pull request for someone to keep track of all of them and if
> someone things that a certain person would be good for reviewing a change
> then tagging them helps them notice the PR.
>
> I think the tag should not mean that only that person can/should review the
> PR, it should serve as a proposal.
>
> I'm happy to not use it anymore if everyone else doesn't like them.
>
> On Sat, 21 Jan 2017 at 00:53 Fabian Hueske  wrote:
>
> > Hi Haohui,
> >
> > reviewing pull requests is a great way of contributing to the community!
> >
> > I am not aware of specific instructions for the review process. The are
> > some dos and don'ts on our "contribute code" page [1] that should be
> > considered. Apart from that, I think the best way to start is to become
> > familiar with a certain part of the code base (reading code,
> contributing)
> > and then to look out for pull requests that address the part you are
> > familiar with.
> >
> > The review does not have to cover all aspects of a PR (a committer will
> > have a look as well), but from my personal experience the effort to
> review
> > a PR is often much lower if some other person has had a look at it
> already
> > and gave feedback.
> > I think this can help a lot to reduce the review "load" on the
> committers.
> > Maybe you find some contributors who are interested in the same
> components
> > as you and you can start reviewing each others code.
> >
> > Thanks,
> > Fabian
> >
> > [1] http://flink.apache.org/contribute-code.html#coding-guidelines
> >
> >
> > 2017-01-20 23:02 GMT+01:00 jincheng sun :
> >
> > > I totally agree with all of your ideas.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Best wishes,
> > >
> > >
> > >
> > > SunJincheng.
> > >
> > > Stephan Ewen 于2017年1月16日 周一19:42写道:
> > >
> > > > Hi!
> > > >
> > > >
> > > >
> > > > I have seen that recently many pull requests designate reviews by
> > writing
> > > >
> > > > "@personA review please" or so.
> > > >
> > > >
> > > >
> > > > I am personally quite strongly against that, I think it hurts the
> > > community
> > > >
> > > > work:
> > > >
> > > >
> > > >
> > > >   - The same few people get usually "designated" and will typically
> get
> > > >
> > > > overloaded and often not do the review.
> > > >
> > > >
> > > >
> > > >   - At the same time, this discourages other community members from
> > > looking
> > > >
> > > > at the pull request, which is totally undesirable.
> > > >
> > > >
> > > >
> > > >   - In general, review participation should be "pull based" (person
> > > decides
> > > >
> > > > what they want to work on) not "push based" (random person pushes
> work
> > to
> > > >
> > > > another person). Push-based just creates the wrong feeling in a
> > community
> > > >
> > > > of volunteers.
> > > >
> > > >
> > > >
> > > >   - In many cases the designated reviews are not the ones most
> > > >
> > > > knowledgeable in the code, which is understandable, because how
> should
> > > >
> > > > contributors know whom to tag?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Long story short, why don't we just drop that habit?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Greetings,
> > > >
> > > > Stephan
> > > >
> > > >
> > >
> >
>


Re: New Flink team member - Kate Eri.

2017-01-17 Thread Theodore Vasiloudis
Hello Katherin,

Welcome to the Flink community!

The ML component definitely needs a lot of work you are correct, we are
facing similar problems to CEP, which we'll hopefully resolve with the
restructuring Stephan has mentioned in that thread.

If you'd like to help out with PRs we have many open, one I have started
reviewing but got side-tracked is the Word2Vec one [1].

Best,
Theodore

[1] https://github.com/apache/flink/pull/2735

On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske  wrote:

> Hi Katherin,
>
> welcome to the Flink community!
> Help with reviewing PRs is always very welcome and a great way to
> contribute.
>
> Best, Fabian
>
>
>
> 2017-01-17 11:17 GMT+01:00 Katherin Sotenko :
>
> > Thank you, Timo.
> > I have started the analysis of the topic.
> > And if it necessary, I will try to perform the review of other pulls)
> >
> >
> > вт, 17 янв. 2017 г. в 13:09, Timo Walther :
> >
> > > Hi Katherin,
> > >
> > > great to hear that you would like to contribute! Welcome!
> > >
> > > I gave you contributor permissions. You can now assign issues to
> > > yourself. I assigned FLINK-1750 to you.
> > > Right now there are many open ML pull requests, you are very welcome to
> > > review the code of others, too.
> > >
> > > Timo
> > >
> > >
> > > Am 17/01/17 um 10:39 schrieb Katherin Sotenko:
> > > > Hello, All!
> > > >
> > > >
> > > >
> > > > I'm Kate Eri, I'm java developer with 6-year enterprise experience,
> > also
> > > I
> > > > have some expertise with scala (half of the year).
> > > >
> > > > Last 2 years I have participated in several BigData projects that
> were
> > > > related to Machine Learning (Time series analysis, Recommender
> systems,
> > > > Social networking) and ETL. I have experience with Hadoop, Apache
> Spark
> > > and
> > > > Hive.
> > > >
> > > >
> > > > I’m fond of ML topic, and I see that Flink project requires some work
> > in
> > > > this area, so that’s why I would like to join Flink and ask me to
> grant
> > > the
> > > > assignment of the ticket
> > > https://issues.apache.org/jira/browse/FLINK-1750
> > > > to me.
> > > >
> > >
> > >
> >
>


[jira] [Created] (FLINK-5087) Additional steps needed for the Java quickstart guide

2016-11-17 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-5087:
--

 Summary: Additional steps needed for the Java quickstart guide
 Key: FLINK-5087
 URL: https://issues.apache.org/jira/browse/FLINK-5087
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: Theodore Vasiloudis
Priority: Minor


While the quickstart guide indicates that you should be able to just run the 
examples
from the Maven archetype that was not the case for me, what I got instead was
ClassNotFound exceptions because the default run configuration does not pull in
the dependencies as it should.

What I needed to do to get the examples to run from within the IDE was:

1) In project structure, add a new module say "mainRunner"
2) In mainRunner's dependencies add the main module (say "quickstart") as a 
module depency.
3) In mainRunner's dependencies add the rest of the project library 
dependencies.
4) In the run configuration for the example change the "Use claspath of module" 
to mainRunner from quickstart.

I think we should either include the instructions in the docs, or if possible 
change the Java quickstart archetype to include the extra module.
The Scala quickstart has the mainRunner module already, and the appropriate 
instructions
on how to run from within IDEA are in the docs.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Flink ML recommender system API

2016-11-10 Thread Theodore Vasiloudis
Hello Gabor,

for this type of issue (design decisions) what we've done in the past with
FlinkML is to open a PR marked with the WIP tag and take the discussion
there, making it easier
for people to check out the code and get a feel of advantages/disadvantages
of different approaches.

Could you do that for this issue?

Regards,
Theodore

On Thu, Nov 10, 2016 at 12:46 PM, Gábor Hermann 
wrote:

> Hi all,
>
> We have managed to fit the ranking recommendation evaluation into the
> evaluation framework proposed by Thedore (FLINK-2157). There's one main
> problem, that remains: we have to different predictor traits (Predictor,
> RankingPredictor) without a common superclass, and that might be
> problematic.
>
> Please see the details at the issue:
> https://issues.apache.org/jira/browse/FLINK-4713
>
> Could you give feedback on whether we are moving in the right direction or
> not? Thanks!
>
> Cheers,
> Gabor
>
>


Re: SVMITSuite Testing

2016-10-26 Thread Theodore Vasiloudis
Hello Jesse,

Could you tell us how you try to run the tests?

As Gabor said if you are using IDEA the easiest way to run a specific test
is to open the test file, right click somewhere in the code and select "Run
SVMITSuite"

Regards,
Theodore

On Oct 25, 2016 9:54 PM, "Jesse Bannon" 
wrote:

> Hello,
>
> I am trying to run the SVMITSuite in the Flink-ML library. When I build the
> package it seems to skip all tests in SVMITSuite - I'm assuming it's
> because there's no ExecutionEnvironment set up to use the DataSet API.
>
> I can't seem to find any documentation on how to run this either. Any help
> would be appreciated.
>
> Thanks in advanced,
> ~ Jesse
>


[jira] [Created] (FLINK-4908) Add docs about evaluate operation to all predictors

2016-10-25 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-4908:
--

 Summary: Add docs about evaluate operation to all predictors
 Key: FLINK-4908
 URL: https://issues.apache.org/jira/browse/FLINK-4908
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Priority: Minor


We have added an {{evaluate}} operation to the {{Predictor}} class, making it 
available to all Predictors (SVM, MLR), however their documentation only 
mentions the {{predict}} and {{fit}} operations.

We should properly document all the available operations for each Predictor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Implicit class RichExecutionEnvironment - Can't use MlUtils.readLibSVM(path) in QUickStart guide

2016-10-21 Thread Theodore Vasiloudis
I've copy pasted your code to an example and it compiles fine. Are you sure
your project imports are done correctly?

Here's the sbt file I'm using:

resolvers in ThisBuild ++= Seq("Apache Development Snapshot
Repository" at "https://repository.apache.org/content/repositories/snapshots/;,
  Resolver.mavenLocal)

name := "Flink Project"

version := "0.1-SNAPSHOT"

organization := "org.example"

scalaVersion in ThisBuild := "2.11.7"

val flinkVersion = "1.1.0"

val flinkDependencies = Seq(
  "org.apache.flink" %% "flink-scala" % flinkVersion % "provided",
  "org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided",
  "org.apache.flink" %% "flink-ml" % flinkVersion % "provided")

lazy val root = (project in file(".")).
  settings(
libraryDependencies ++= flinkDependencies
  )

It comes from Till's Flink Quickstart project
<https://github.com/tillrohrmann/flink-project>.

The RichExecutionEnvironment comes from the import org.apache.flink.ml._
import.


On Thu, Oct 20, 2016 at 7:07 PM, Thomas FOURNIER <
thomasfournier...@gmail.com> wrote:

> Yep I've done it: import org.apache.flink.api.scala._
>
> I had reported this issue but still have the same problem.
>
> My code is the following (with imports)
>
> import org.apache.flink.api.scala._
> import org.apache.flink.ml._
>
> import org.apache.flink.ml.classification.SVM
> import org.apache.flink.ml.common.LabeledVector
> import org.apache.flink.ml.math.DenseVector
> import org.apache.flink.ml.math.Vector
>
> object App {
>
>   def main(args: Array[String]) {
>
> val env = ExecutionEnvironment.getExecutionEnvironment
> val survival = env.readCsvFile[(String, String, String,
> String)]("src/main/resources/haberman.data", ",")
>
>
> val survivalLV = survival
>   .map { tuple =>
> val list = tuple.productIterator.toList
> val numList = list.map(_.asInstanceOf[String].toDouble)
> LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
>   }
>
>
>
> val astroTrain: DataSet[LabeledVector] =
> MLUtils.readLibSVM(env,"src/main/resources/svmguide1")
>
> val astroTest: DataSet[(Vector, Double)] = MLUtils
>   .readLibSVM(env, "src/main/resources/svmguide1.t")
>   .map(l => (l.vector, l.label))
>
> val svm = SVM()
>   .setBlocks(env.getParallelism)
>   .setIterations(100)
>   .setRegularization(0.001)
>   .setStepsize(0.1)
>   .setSeed(42)
>
> svm.fit(astroTrain)
> println(svm.toString)
>
>
> val predictionPairs = svm.evaluate(astroTest)
> predictionPairs.print()
>
>   }
> }
>
>
>
> And I can't write:
>
> MLUtils.readLibSVM("src/main/resources/svmguide1")
>
>
>
>
>
>
>
> 2016-10-20 16:26 GMT+02:00 Theodore Vasiloudis <
> theodoros.vasilou...@gmail.com>:
>
> > This has to do with not doing a wildcard import of the Scala api, it was
> > reported and already fixed on master [1]
> >
> > [1]
> > http://apache-flink-mailing-list-archive.1008284.n3.
> > nabble.com/jira-Created-FLINK-4792-Update-documentation-
> > QuickStart-FlinkML-td13936.html
> >
> > --
> > Sent from a mobile device. May contain autocorrect errors.
> >
> > On Oct 20, 2016 2:06 PM, "Thomas FOURNIER" <thomasfournier...@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > Following QuickStart guide in FlinkML, I have to do the following:
> > >
> > > val astroTrain:DataSet[LabeledVector] = MLUtils.readLibSVM(env,
> > > "src/main/resources/svmguide1")
> > >
> > > Instead of:
> > >
> > > val astroTrain:DataSet[LabeledVector] = MLUtils.readLibSVM(
> > > "src/main/resources/svmguide1")
> > >
> > >
> > > Nonetheless, this implicit class in ml/packages
> > >
> > > implicit class RichExecutionEnvironment(executionEnvironment:
> > > ExecutionEnvironment) {
> > >   def readLibSVM(path: String): DataSet[LabeledVector] = {
> > > MLUtils.readLibSVM(executionEnvironment, path)
> > >   }
> > > }
> > >
> > >
> > > is supposed to pimp MLUtils in the way we want.
> > >
> > > Does it mean that RichExecutionEnvironment is not imported in the
> scope ?
> > > What can be done to solve this ?
> > >
> > >
> > > Thanks
> > >
> > > Regards
> > > Thomas
> > >
> >
>


Re: Implicit class RichExecutionEnvironment - Can't use MlUtils.readLibSVM(path) in QUickStart guide

2016-10-20 Thread Theodore Vasiloudis
This has to do with not doing a wildcard import of the Scala api, it was
reported and already fixed on master [1]

[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/jira-Created-FLINK-4792-Update-documentation-QuickStart-FlinkML-td13936.html

-- 
Sent from a mobile device. May contain autocorrect errors.

On Oct 20, 2016 2:06 PM, "Thomas FOURNIER" 
wrote:

> Hello,
>
> Following QuickStart guide in FlinkML, I have to do the following:
>
> val astroTrain:DataSet[LabeledVector] = MLUtils.readLibSVM(env,
> "src/main/resources/svmguide1")
>
> Instead of:
>
> val astroTrain:DataSet[LabeledVector] = MLUtils.readLibSVM(
> "src/main/resources/svmguide1")
>
>
> Nonetheless, this implicit class in ml/packages
>
> implicit class RichExecutionEnvironment(executionEnvironment:
> ExecutionEnvironment) {
>   def readLibSVM(path: String): DataSet[LabeledVector] = {
> MLUtils.readLibSVM(executionEnvironment, path)
>   }
> }
>
>
> is supposed to pimp MLUtils in the way we want.
>
> Does it mean that RichExecutionEnvironment is not imported in the scope ?
> What can be done to solve this ?
>
>
> Thanks
>
> Regards
> Thomas
>


Re: FlinkML - Evaluate function should manage LabeledVector

2016-10-20 Thread Theodore Vasiloudis
I think this might be problematic with the current way we define the
predict operations because they require that both the Testing and
PredictionValue types are available.

Here's what I had to do to get it to work (in ml/pipeline/Predictor.scala):

import org.apache.flink.ml.math.{Vector => FlinkVector}
implicit def labeledVectorEvaluateDataSetOperation[
Instance <: Estimator[Instance],
Model,
FlinkVector,
Double](
implicit predictOperation: PredictOperation[Instance, Model,
FlinkVector, Double],
  testingTypeInformation: TypeInformation[FlinkVector],
  predictionValueTypeInformation: TypeInformation[Double])
: EvaluateDataSetOperation[Instance, LabeledVector, Double] = {
  new EvaluateDataSetOperation[Instance, LabeledVector, Double] {
override def evaluateDataSet(
  instance: Instance,
  evaluateParameters: ParameterMap,
  testing: DataSet[LabeledVector])
: DataSet[(Double,  Double)] = {
  val resultingParameters = instance.parameters ++ evaluateParameters
  val model = predictOperation.getModel(instance, resultingParameters)

  implicit val resultTypeInformation =
createTypeInformation[(FlinkVector, Double)]

  testing.mapWithBcVariable(model){
(element, model) => {
  (element.label.asInstanceOf[Double],
predictOperation.predict(element.vector.asInstanceOf[FlinkVector],
model))
}
  }
}
  }
}

I'm not a fan of casting objects, but the compiler complains here otherwise.

Maybe someone has some input as to why the casting is necessary here, given
that the underlying types are correct? Probably has to do with some type
erasure I'm not seeing here.

--Theo

On Wed, Oct 19, 2016 at 10:30 PM, Thomas FOURNIER <
thomasfournier...@gmail.com> wrote:

> Hi,
>
> Two questions:
>
> 1- I was thinking of doing this:
>
> implicit def evaluateLabeledVector[T <: LabeledVector] = {
>
>   new EvaluateDataSetOperation[SVM,T,Double]() {
>
> override def evaluateDataSet(instance: SVM, evaluateParameters:
> ParameterMap, testing: DataSet[T]): DataSet[(Double, Double)] = {
>   val predictor = ...
>   testing.map(l => (l.label, predictor.predict(l.vector)))
>
> }
>   }
> }
>
> How can I access to my predictor object (predictor has type
> PredictOperation[SVM, DenseVector, T, Double]) ?
>
> 2- My first idea was to develop a predictOperation[T <: LabeledVector]
> so that I could use implicit def defaultEvaluateDatasetOperation
>
> to get an EvaluateDataSetOperationObject. Is it also valid or not ?
>
> Thanks
> Regards
>
> Thomas
>
>
>
>
>
>
>
> 2016-10-19 16:26 GMT+02:00 Theodore Vasiloudis <
> theodoros.vasilou...@gmail.com>:
>
> > Hello Thomas,
> >
> > since you are calling evaluate here, you should be creating an
> > EvaluateDataSet operation that works with LabeledVector, I see you are
> > creating a new PredictOperation.
> >
> > On Wed, Oct 19, 2016 at 3:05 PM, Thomas FOURNIER <
> > thomasfournier...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I'd like to improve SVM evaluate function so that it can use
> > LabeledVector
> > > (and not only Vector).
> > > Indeed, what is done in test is the following (data is a
> > > DataSet[LabeledVector]):
> > >
> > > val test = data.map(l => (l.vector, l.label))
> > > svm.evaluate(test)
> > >
> > > We would like to do:
> > > sm.evaluate(data)
> > >
> > >
> > > Adding this "new" code:
> > >
> > > implicit def predictLabeledPoint[T <: LabeledVector] = {
> > >  new PredictOperation  ...
> > > }
> > >
> > > gives me a predictOperation that should be used with
> > > defaultEvaluateDataSetOperation
> > > with the correct signature (ie with T <: LabeledVector and not T<:
> > Vector).
> > >
> > > Nonetheless, tests are failing:
> > >
> > >
> > > it should "predict with LabeledDataPoint" in {
> > >
> > >   val env = ExecutionEnvironment.getExecutionEnvironment
> > >
> > >   val svm = SVM().
> > > setBlocks(env.getParallelism).
> > > setIterations(100).
> > > setLocalIterations(100).
> > > setRegularization(0.002).
> > > setStepsize(0.1).
> > > setSeed(0)
> > >
> > >   val trainingDS = env.fromCollection(Classification.trainingData)
> > >   svm.fit(trainingDS)
> > >   val predictionPairs = svm.evaluate(trainingDS)
> > >
> > >   
> > > }
> > >
> > > There is no PredictOperation defined for
> > > org.apache.flink.ml.classification.SVM which takes a
> > > DataSet[org.apache.flink.ml.common.LabeledVector] as input.
> > > java.lang.RuntimeException: There is no PredictOperation defined for
> > > org.apache.flink.ml.classification.SVM which takes a
> > > DataSet[org.apache.flink.ml.common.LabeledVector] as input.
> > >
> > >
> > >
> > > Thanks
> > >
> > > Regards
> > > Thomas
> > >
> >
>


Re: FlinkML - Evaluate function should manage LabeledVector

2016-10-19 Thread Theodore Vasiloudis
Hello Thomas,

since you are calling evaluate here, you should be creating an
EvaluateDataSet operation that works with LabeledVector, I see you are
creating a new PredictOperation.

On Wed, Oct 19, 2016 at 3:05 PM, Thomas FOURNIER <
thomasfournier...@gmail.com> wrote:

> Hi,
>
> I'd like to improve SVM evaluate function so that it can use LabeledVector
> (and not only Vector).
> Indeed, what is done in test is the following (data is a
> DataSet[LabeledVector]):
>
> val test = data.map(l => (l.vector, l.label))
> svm.evaluate(test)
>
> We would like to do:
> sm.evaluate(data)
>
>
> Adding this "new" code:
>
> implicit def predictLabeledPoint[T <: LabeledVector] = {
>  new PredictOperation  ...
> }
>
> gives me a predictOperation that should be used with
> defaultEvaluateDataSetOperation
> with the correct signature (ie with T <: LabeledVector and not T<: Vector).
>
> Nonetheless, tests are failing:
>
>
> it should "predict with LabeledDataPoint" in {
>
>   val env = ExecutionEnvironment.getExecutionEnvironment
>
>   val svm = SVM().
> setBlocks(env.getParallelism).
> setIterations(100).
> setLocalIterations(100).
> setRegularization(0.002).
> setStepsize(0.1).
> setSeed(0)
>
>   val trainingDS = env.fromCollection(Classification.trainingData)
>   svm.fit(trainingDS)
>   val predictionPairs = svm.evaluate(trainingDS)
>
>   
> }
>
> There is no PredictOperation defined for
> org.apache.flink.ml.classification.SVM which takes a
> DataSet[org.apache.flink.ml.common.LabeledVector] as input.
> java.lang.RuntimeException: There is no PredictOperation defined for
> org.apache.flink.ml.classification.SVM which takes a
> DataSet[org.apache.flink.ml.common.LabeledVector] as input.
>
>
>
> Thanks
>
> Regards
> Thomas
>


Re: Flink ML recommender system API

2016-10-04 Thread Theodore Vasiloudis
Hello all,

Thanks for starting this discussion Gabor you bring up a lot of interesting
points.

In terms of the evaluation framework I would also favor reworking it in
order to support recommendation models. We can either we merge the current
PR and use it as a basis, or open a new one.

For the form of the evaluation I'm also in favor of the flat form that
allows us to have a scalable system.

For the development of DSGD I would recommend taking into consideration the
current limitations of SGD in FlinkML in term of sampling within
iterations. Due to the nature of iterations at the time we developed SGD it
was not possible to select samples from the dataset while we iterate over
the weights DS. That means that the SGD implementation in FlinkML is
actually simple GD. I can recommend reviewing the discussion in [2] (issue
is at [3]) and perhaps open a new thread as that is major issue on its own
and a blocker for many ML algorithms which modify more than one DS per
iteration.
Would that be an issue for DSGD as well?

Finally regarding the decision of having a model as a separate object from
the algorithm, just to add another perspective to what Till said, there are
also implications in terms of model export/import (esp. with distributed
models), in which case it's probably made easier by decoupling training
from the object.

This is  essentially a design decision with pros and cons on both sides.
Our decision was mostly influenced from the considerations that Till
already mentioned about pipelines. The same approach was taken in
scikit-learn, and I encourage you to read their thinking in page 5 of [1],
a document we generally kept in mind when developing FlinkML.

That being said I'm open to re-evaluating this decision, the issue at hand
is, as Till mentioned, that the pipelining mechanism would probably have to
be re-worked in order to allow for this change.

[1] "API design for machine learning software: experiences from the
scikit-learn project" https://arxiv.org/abs/1309.0238
[2] https://issues.apache.org/jira/browse/FLINK-1807
[3] https://issues.apache.org/jira/browse/FLINK-2396

On Tue, Oct 4, 2016 at 2:04 PM, Till Rohrmann  wrote:

> Hi Gabor,
>
> thanks for getting involved in Flink's ML library. Always good to have
> people working on it :-)
>
> Some thoughts concerning the points you've raised inline:
>
> On Tue, Oct 4, 2016 at 12:47 PM, Gábor Hermann 
> wrote:
>
>> Hey all,
>>
>> We've been working on improvements for the recommendation in Flink ML,
>> and some API design questions have come up. Our plans in short:
>>
>> - Extend ALS to work on implicit feedback datasets [1]
>> - DSGD implementation for matrix factorization [2]
>>
> Have looked at GradientDescent? Maybe this is already doing what you want
> to implement or can be adapted to do it.
>
>
>> - Ranking prediction based on a matrix factorization model [3]
>
> - Evaluations for recommenders (precision, recall, nDCG) [4]
>>
>>
>> First, we've seen that an evaluation framework has been implemented (in a
>> not yet merged PR [5]), but evalations of recommenders would not fit into
>> this framework. This is basically because recommender evaluations, instead
>> of comparing real numbers or fixed size vectors, compare top lists of
>> possible different, arbitrary large sizes. The details are descirbed in
>> FLINK-4713 [4]. I see three possible solutions for this:
>>
>> - we either rework the evaluation framework proposed in [5] to allow
>> inputs suitable for recommender evaluations
>> - or fit the recommender evaluations in the framework in a kind of
>> unnatural form with possible bad performance implications
>> - or do not fit recommender evaluations in the framework at all
>>
>> I would prefer reworking the evaluation framework, but it's up to
>> discussion. It also depends on whether the PR will be merged soon or not.
>> Theodore, what are your thoughts on this as the author of the eval
>> framework?
>>
> It would be great if the evaluation framework in the PR could be adapted
> to also support evaluating recommenders, if there is no fundamental reason
> not to do it. But my gut feeling is that it should not be impossible.
>
>
>>
>> Second, picking the form of evaluation also affects how we should give
>> the ranking prediction. We could choose a flat form (i.e.
>> DataSet[(Int,Int,Int)]) or represent the rankings in an array (i.e.
>> DataSet[(Int,Array[Int])]). See details in [4]. The flat form would allow
>> the system to work distributedly, so I'd go with that representation, but
>> it's also up to discussion.
>>
> It would be great to keep scalability in mind. Thus, I would go with the
> more scalable version.
>
>
>>
>>
>> Last, ALS and DSGD are two different algorithms for training the same
>> matrix factorization model, but in the current API could not be really
>> visible to the user. Training an ALS model modifies the ALS object and puts
>> a matrix factorization model in it. We 

Re: ML contributions

2016-09-15 Thread Theodore Vasiloudis
That's great to hear Gabor, I'll definitely help out with the review
process, and I hope we can get some committer to look into these and other
outstanding PRs for FlinkML.

On Thu, Sep 15, 2016 at 11:59 AM, Till Rohrmann 
wrote:

> Great to hear Gabor :-) I hope that the community will help out with
> reviewing of the algorithms so that we can give quick feedback. Looking
> forward to your contributions.
>
> Cheers,
> Till
>
> On Mon, Sep 12, 2016 at 10:34 AM, Gábor Hermann 
> wrote:
>
> > Hey all,
> >
> > We are planning to contribute some algorithms and improvements to Flink
> ML
> > at SZTAKI .
> > I have already opened a JIRA  > a/browse/FLINK-4613> for an implicit feedback ALS, but probably more will
> > come soon.
> >
> > We are implementing algorithms anyway, and it would be nice if we could
> > incorporate them into Flink,
> > so I just wanted to let you all know about our efforts.
> >
> > Cheers,
> > Gabor
> >
>


Re: N-ary stream operators - status

2016-08-10 Thread Theodore Vasiloudis
Hello Aljoscha,

Do you think the side inputs might make it to 1.2?

On Aug 10, 2016 2:37 AM, "Aljoscha Krettek"  wrote:

> Hi,
> I thought about this while thinking about how to add side inputs to Flink,
> as mentioned in the doc. Right now we're focusing on getting a bunch of
> other features ready for Flink 1.2 so it will probably still take a while
> until someone can start actively working on this again.
>
> There is this branch where I have an initial version of the n-ary operator
> and side inputs running:
> https://github.com/aljoscha/flink/tree/operator-ng-side-input-wrapper
>
> Cheers,
> Aljoscha
>
> On Tue, 9 Aug 2016 at 17:54 Gábor Gévay  wrote:
>
> > Hello,
> >
> > There is this Google Doc about adding n-ary stream operators to Flink:
> >
> > https://docs.google.com/document/d/1ZFzL_0xGuUEnBsFyEiHwWcmCcjhd9ArWsmh
> rgnt05RI/edit
> >
> > I would like to ask what are the plans for when will this feature be
> > available?
> >
> > Best,
> > Gábor
> >
>


Re: Conceptual difference Windows and DataSet

2016-08-06 Thread Theodore Vasiloudis
Hello Kevin,

I'm not very familiar with the stream API, but I think you can achieve what
you want by mapping over your elements to turn the
strings into one-item lists, so that you get a key-value that is (K:
String, V: (List[String], Int))  and then apply the window reduce function,
which produces a data stream out of
a windowed stream, you combine your lists there and sum the value. Again,
it's not a great way to use reduce, since you are growing the list with
each reduction.

Regards,
Theodore

On Thu, Aug 4, 2016 at 1:36 AM, Kevin Jacobs  wrote:

> Hi,
>
> I have the following use case:
>
> 1. Group by a specific field.
>
> 2. Get a list of all messages belonging to the group.
>
> 3. Count the number of records in the group.
>
> With the use of DataSets, it is fairly easy to do this (see
> http://stackoverflow.com/questions/38745446/apache-flink-
> sum-and-keep-grouped/38747685#38747685):
>
> |fromElements(("a-b", "data1", 1), ("a-c", "data2", 1), ("a-b", "data3",
> 1)). groupBy(0). reduceGroup { (it: Iterator[(String, String, Int)], out:
> Collector[(String, List[String], Int)]) => { val group = it.toList if
> (group.length > 0) out.collect((group(0)._1, group.map(_._2),
> group.map(_._3).sum)) } |
>
> So, now I am moving to DataStreams (since the input is really a
> DataStream). From my perspective, a Window should provide the same
> functionality as a DataSet. This would easify the process a lot:
>
> 1. Window the elements.
>
> 2. Apply the same operations as before.
>
> Is there a way in Flink to do so? Otherwise, I would like to think of a
> solution to this problem.
>
> Regards,
> Kevin
>


Re: [DISCUSS] Move JIRA creation emails to separate list?

2016-07-15 Thread Theodore Vasiloudis
@Stephan, I'm not subscribed to issues@. The dev list itself includes JIRA
creation emails (check out the archives
<https://mail-archives.apache.org/mod_mbox/flink-dev/201607.mbox/browser>)

@Aljoscha. It was mostly a concern for new users, who will get inundated
with JIRA emails until they set up a filter. I don't see how removing a set
of emails (and moving them to issues@) would be a problem for any filters
people have set up. Anything to do with JIRA issues should go to the issues
list, makes sense no?

On Fri, Jul 15, 2016 at 8:30 AM, Stephan Ewen <se...@apache.org> wrote:

> Most of the JIRA noise should be on the "iss...@flink.apache.org" list.
> Make sure you are not subscribed there if you do not want that noise.
>
>
>
>
> On Fri, Jul 15, 2016 at 9:22 AM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> > I'm afraid most people are ignoring this because they already have
> filters
> > set up and don't see it as a problem. If we now change the way the lists
> > are set up this could potentially break the personal mail setup of some
> > contributors.
> >
> > On Thu, 14 Jul 2016 at 19:56 Theodore Vasiloudis <
> > theodoros.vasilou...@gmail.com> wrote:
> >
> > > Hello all,
> > >
> > > I'm not sure if this has been discussed  before (or if this is a an
> > option
> > > when joining the list in which case ignore this)
> > > but I would like to suggest moving all the issue creation emails to a
> > list
> > > separate from dev.
> > >
> > > Right now if I come back to the list after say a week most of the
> actual
> > > dev
> > > discussions are pushed out from all the created JIRA emails.
> > >
> > > I know I can filter these out myself, but I do feel like they belong
> to a
> > > different list, this way we can focus on the discussion about
> > > improvements etc. without all the JIRA noise.
> > >
> > > Regards,
> > > Theodore
> > >
> >
>


[DISCUSS] Move JIRA creation emails to separate list?

2016-07-14 Thread Theodore Vasiloudis
Hello all,

I'm not sure if this has been discussed  before (or if this is a an option
when joining the list in which case ignore this)
but I would like to suggest moving all the issue creation emails to a list
separate from dev.

Right now if I come back to the list after say a week most of the actual dev
discussions are pushed out from all the created JIRA emails.

I know I can filter these out myself, but I do feel like they belong to a
different list, this way we can focus on the discussion about
improvements etc. without all the JIRA noise.

Regards,
Theodore


Re: [PROPOSAL] Structure the Flink Open Source Development

2016-05-16 Thread Theodore Vasiloudis
I like the idea of having maintainers as well, hopefully we can streamline
the reviewing process.

I of course can volunteer for the FlinkML component.
As I've mentioned before I'd love to get one more committer willing to
review PRs in FlinkML; by my last count we were up to ~20 open ML-related
PRs.

Regards,
Theodore

On Mon, May 16, 2016 at 2:17 AM, Henry Saputra 
wrote:

> The maintainers concept is good idea to make sure PRs are moved smoothly.
>
> But, we need to make sure that this is not additional hierarchy on top of
> Flink PMCs.
> This will keep us in spirit of ASF community over code.
>
> Please do add me as cluster management maintainer member.
>
> - Henry
>
> On Tuesday, May 10, 2016, Stephan Ewen  wrote:
>
> > Hi everyone!
> >
> > We propose to establish some lightweight structures in the Flink open
> > source community and development process,
> > to help us better handle the increased interest in Flink (mailing list
> and
> > pull requests), while not overwhelming the
> > committers, and giving users and contributors a good experience.
> >
> > This proposal is triggered by the observation that we are reaching the
> > limits of where the current community can support
> > users and guide new contributors. The below proposal is based on
> > observations and ideas from Till, Robert, and me.
> >
> > 
> > Goals
> > 
> >
> > We try to achieve the following
> >
> >   - Pull requests get handled in a timely fashion
> >   - New contributors are better integrated into the community
> >   - The community feels empowered on the mailing list.
> > But questions that need the attention of someone that has deep
> > knowledge of a certain part of Flink get their attention.
> >   - At the same time, the committers that are knowledgeable about many
> core
> > parts do not get completely overwhelmed.
> >   - We don't overlook threads that report critical issues.
> >   - We always have a pretty good overview of what the status of certain
> > parts of the system are.
> >   -> What are often encountered known issues
> >   -> What are the most frequently requested features
> >
> >
> > 
> > Problems
> > 
> >
> > Looking into the process, there are two big issues:
> >
> > (1) Up to now, we have been relying on the fact that everything just
> > "organizes itself", driven by best effort. That assumes
> > that everyone feels equally responsible for every part, question, and
> > contribution. At the current state, this is impossible
> > to maintain, it overwhelms the committers and contributors.
> >
> > Example: Pull requests are picked up by whoever wants to pick them up.
> Pull
> > requests that are a lot of work, have little
> > chance of getting in, or relate to less active components are sometimes
> not
> > picked up. When contributors are pretty
> > loaded already, it may happen that no one eventually feels responsible to
> > pick up a pull request, and it falls through the cracks.
> >
> > (2) There is no good overview of what are known shortcomings, efforts,
> and
> > requested features for different parts of the system.
> > This information exists in various peoples' heads, but is not easily
> > accessible for new people. The Flink JIRA is not well
> > maintained, it is not easy to draw insights from that.
> >
> >
> > ===
> > The Proposal
> > ===
> >
> > Since we are building a parallel system, the natural solution seems to
> be:
> > partition the workload ;-)
> >
> > We propose to define a set of components for Flink. Each component is
> > maintained or tracked by one or more
> > people - let's call them maintainers. It is important to note that we
> don't
> > suggest the maintainers as an authoritative role, but
> > simply as committers or contributors that visibly step up for a certain
> > component, and mainly track and drive the efforts
> > pertaining to that component.
> >
> > It is also important to realize that we do not want to suggest that
> people
> > get less involved with certain parts and components, because
> > they are not the maintainers. We simply want to make sure that each pull
> > request or question or contribution has in the end
> > one person (or a small set of people) responsible for catching and
> tracking
> > it, if it was not worked on by the pro-active
> > community.
> >
> > For some components, having multiple maintainers will be helpful. In that
> > case, one maintainer should be the "chair" or "lead"
> > and make sure that no issue of that component gets lost between the
> > multiple maintainers.
> >
> >
> > A maintainers' role is:
> > -
> >
> >   - Have an overview of which of the open pull requests relate to their
> > component
> >   - Drive the pull requests relating to the component to resolution
> >   => Moderate the decision whether the feature should be merged
> >   => Make sure the pull request gets a shepherd.
> >In many 

Re: A whole bag of ML issues

2016-03-29 Thread Theodore Vasiloudis
> Adding a setOptimizer to IterativeSolver.

Do you mean MLR here? IterativeSolver is implemented by different solvers,
I don't think adding a method like this makes sense there.

In the case of MLR a better alternative that includes a bit more work is to
create a Generalized Linear Model framework that provides
implementations for the most common linear models (ridge, lasso etc.) I had
already started work on this here
<https://github.com/thvasilo/flink/commits/glm>, but never got around
to opening a PR. The relevant JIRA is here
<https://issues.apache.org/jira/browse/FLINK-2013>. Having a setOptimizer
method in GeneralizedLinearModel (with some restrictions/warnings
regarding choice of optimizer and regularization) would be the preferred
option for me at least.

Other than that the list looks fine :)

On Tue, Mar 29, 2016 at 9:32 PM, Trevor Grant <trevor.d.gr...@gmail.com>
wrote:

> OK, I'm trying to respond to you and Till in one thread so someone call me
> out if I missed a point but here goes:
>
> SGD Predicting Vectors :  There was discussion in the past regarding this-
> at the time it was decided to go with only Doubles for simplicity. I feel
> strongly that there is cause now for predicting vectors.  This should be a
> separate PR.  I'll open an issue, we can refer to earlier mailing list and
> reopen discussion on best way to proceed
>
> Warm Starts : Basically all that needs to be done here is for the iterative
> solver to keep track of what iteration it is on, and start from that
> iteration is WarmStart == True, then go another N iterations.  I don't
> think savepoints solves this because of the way stepsizes are calculated in
> SGD, though I don't know enough about savepoints to say for sure.  As Till
> said, and I agree, very simple fix.  Use cases: Testing how new features
> (e.g. stepsizes) increase / decrease convergence, e.g. fit a model in 1000
> data point bursts and measure the error, see how it decreases as time goes
> on. Also, model updates. E.g. I have a huge model that gets trained on a
> year of data and takes a day or two to do so, but after that I just want to
> update it nightly with the data from the last 24 hours, or at the extreme-
> online learning, e.g. every new data point updates the model.
>
> Model Grading Metrics:  I'll chime in on the PR you mentioned.
>
> Weight Arrays vs. Weight Vectors:  Winding/unwinding arrays of matricies
> into vectors it best done inside of methods that need such functionality
> seems to be the concensus. I'm ok with that, as I have such things working
> rather elegantly, but wanted to throw it out there anyway.
>
> BLAS ops for matrices:  I'll take care of this in my code.
>
> adding a 'setOptimizer' parameter to IterativeSolver: Theodore deferred to
> Till, Till said open a PR.  I'll make the default SimpleSGD to maintain
> backwards compatibility
>
> New issues to create:
> [  ] Optimizer to predict vectors or Doubles and maintain backwards
> compatibility.
> [  ] Warm Start Functionality
> [  ] setOptimizer to Iterative Solver, with default to SimpleSGD.
> [  ] Add neuralnets package to FlinkML (Multilayer perceptron is first
> iteration, other flavors to follow).
>
> Let me know if I missed anything.  I'm guessing you guys are done for the
> day so I'll wait until tomorrow night my time (Chicago) before a I move
> ahead on anything, to give you a chance to respond.
>
> Thanks!
> tg
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Tue, Mar 29, 2016 at 4:11 AM, Theodore Vasiloudis <
> theodoros.vasilou...@gmail.com> wrote:
>
> > Hello Trevor,
> >
> > These are indeed a lot of issues, let's see if we can fit the discussion
> > for all of them
> > in one thread.
> >
> > I'll add some comments inline.
> >
> > - Expand SGD to allow for predicting vectors instead of just Doubles.
> >
> >
> > We have discussed this in the past and at that point decided that it
> didn't
> > make
> > sense to change the base SGD implementation to accommodate vectors.
> > The alternatives that were presented at the time were to abstract away
> > the type of the input/output in the Optimizer (allowing for both Vectors
> > and Doubles),
> > or to create specialized classes for each case. That also gives us
> greater
> > flexibility
> > in terms of optimizing performance.
> >
> > In terms of the ANN, I think you can hide away the Vectors in the
> > implementation of the ANN
> > model, and use the Optimizer interfa

Re: a typical ML algorithm flow

2016-03-29 Thread Theodore Vasiloudis
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/api/java/org/apache/flink/api/java/operators/IterativeDataSet.html#closeWith%28org.apache.flink.api.java.DataSet,%20org.apache.flink.api.java.DataSet%29
>>>
>>>> [2]: iterateWithTermination method in
>>>>
>>>>
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/api/scala/index.html#org.apache.flink.api.scala.DataSet
>>>
>>>> [3]:
>>>>
>>>>
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/api/java/org/apache/flink/api/java/operators/IterativeDataSet.html#registerAggregationConvergenceCriterion%28java.lang.String,%20org.apache.flink.api.common.aggregators.Aggregator,%20org.apache.flink.api.common.aggregators.ConvergenceCriterion%29
>>>
>>>> On Mar 26, 2016, at 2:51 AM, Dmitriy Lyubimov <dlie...@gmail.com>
>>>>>
>>>> wrote:
>>>
>>>> Thank you, all :)
>>>>>
>>>>> yes, that's my question. How do we construct such a loop with a
>>>>>
>>>> concrete
>>>
>>>> example?
>>>>>
>>>>> Let's take something nonsensical yet specific.
>>>>>
>>>>> Say, in samsara terms we do something like that :
>>>>>
>>>>> var avg = Double.PositiveInfinity
>>>>> var drmA = ... (construct elsewhere)
>>>>>
>>>>>
>>>>>
>>>>> do {
>>>>>avg = drmA.colMeans.mean // average of col-wise means
>>>>>drmA = drmA - avg // elementwise subtract of average
>>>>>
>>>>> } while (avg > 1e-10)
>>>>>
>>>>> (which probably does not converge in reality).
>>>>>
>>>>> How would we implement that with native iterations in flink?
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 23, 2016 at 2:50 AM, Till Rohrmann <trohrm...@apache.org>
>>>>>
>>>> wrote:
>>>>
>>>>> Hi Dmitriy,
>>>>>>
>>>>>> I’m not sure whether I’ve understood your question correctly, so
>>>>>>
>>>>> please
>>>
>>>> correct me if I’m wrong.
>>>>>>
>>>>>> So you’re asking whether it is a problem that
>>>>>>
>>>>>> stat1 = A.map.reduce
>>>>>> A = A.update.map(stat1)
>>>>>>
>>>>>> are executed on the same input data set A and whether we have to
>>>>>>
>>>>> cache A
>>>
>>>> for that, right? I assume you’re worried that A is calculated twice.
>>>>>>
>>>>>> Since you don’t have a API call which triggers eager execution of the
>>>>>>
>>>>> data
>>>>
>>>>> flow, the map.reduce and map(stat1) call will only construct the data
>>>>>>
>>>>> flow
>>>>
>>>>> of your program. Both operators will depend on the result of A which
>>>>>>
>>>>> is
>>>
>>>> only once calculated (when execute, collect or count is called) and
>>>>>>
>>>>> then
>>>
>>>> sent to the map.reduce and map(stat1) operator.
>>>>>>
>>>>>> However, it is not recommended using an explicit loop to do iterative
>>>>>> computations with Flink. The problem here is that you will basically
>>>>>>
>>>>> unroll
>>>>
>>>>> the loop and construct a long pipeline with the operations of each
>>>>>> iterations. Once you execute this long pipeline you will face
>>>>>>
>>>>> considerable
>>>>
>>>>> memory fragmentation, because every operator will get a proportional
>>>>>> fraction of the available memory assigned. Even worse, if you trigger
>>>>>>
>>>>> the
>>>>
>>>>> execution of your data flow to evaluate the convergence criterion, you
>>>>>>
>>>>> will
>>>>
>>>>> execute for each iteration the complete pipeline which has been built
>>>>>>
>>>>> up so
>>>>
>>>>> far. Thus, you’ll end up with a quadratic complexity in the number of
>>>>>> iterations. Therefore, I would highly recommend using Fl

Re: A whole bag of ML issues

2016-03-29 Thread Theodore Vasiloudis
Hello Trevor,

These are indeed a lot of issues, let's see if we can fit the discussion
for all of them
in one thread.

I'll add some comments inline.

- Expand SGD to allow for predicting vectors instead of just Doubles.


We have discussed this in the past and at that point decided that it didn't
make
sense to change the base SGD implementation to accommodate vectors.
The alternatives that were presented at the time were to abstract away
the type of the input/output in the Optimizer (allowing for both Vectors
and Doubles),
or to create specialized classes for each case. That also gives us greater
flexibility
in terms of optimizing performance.

In terms of the ANN, I think you can hide away the Vectors in the
implementation of the ANN
model, and use the Optimizer interface as is, like A. Ulanov did with the Spark
ANN

implementation .

- Allow for 'warm starts'


I like the idea of having a partiFit-like function, could you present a
couple
of use cases where we might use it? I'm wondering if savepoints already
cover
this functionality.

- A library of model grading metrics.
>

We have a (perpetually) open PR 
for an evaluation framework. Could you
expand on "Having 'calculate RSquare' as a built in method for every
regressor
doesn't seem like an efficient way to do this long term."

-BLAS for matrix ops (this was talked about earlier)


This will be a good addition. If they are specific to the ANN implementation
however I would hide them away from the rest of the code (and include in
that PR
only) until another usecase comes up.

- A neural net has Arrays of matrices of weights (instead of just a vector).
>

Yes this is probably not the most efficient way to do this, but it's the
"least
API breaking" I'm afraid.

- The linear regression implementation currently presumes it will be using
> SGD but I think that should be 'settable' as a parameter
>

The original Optimizer was written the way you described, but we changed it
later IIRC to make it more accessible (e.g. for users that don't know that
you can't match L1 regularization with L-BFGS). Maybe Till can say more
about the other reasons this was changed.


On Mon, Mar 28, 2016 at 8:01 PM, Trevor Grant 
wrote:

> Hey,
>
> I have a working prototype of an multi layer perceptron implementation
> working in Flink.
>
> I made every possible effort to utilize existing code when possible.
>
> In the process of doing this there were some hacks I want/need, and think
> this should be broken up into multiple PRs and possible abstract out the
> whole thing because the MLP implementation I came up with is itself
> designed to be extendable to Long Short Term Memory Networks.
>
> Top level here are some of the sub PRs
>
> - Expand SGD to allow for predicting vectors instead of just Doubles. This
> allows the same NN code (and other algos) to be used for classification,
> transformations, and regressions.
>
> - Allow for 'warm starts' -> this requires adding a parameter to
> IterativeSolver that basically starts on iteration N.  This is somewhat
> akin to the idea of partial fits in sklearn OR making the iterative solver
> have some sort of internal counter and then when you call 'fit' it just
> runs another N iterations (which is set by SetIterations) instead of
> assuming it is back to zero.  This might seem trivial but has significant
> impact on step size calculations.
>
> - A library of model grading metrics. Having 'calculate RSquare' as a built
> in method for every regressor doesn't seem like an efficient way to do this
> long term.
>
> -BLAS for matrix ops (this was talked about earlier)
>
> - A neural net has Arrays of matrices of weights (instead of just a
> vector).  Currently I flatten the array of matrices out into a weight
> vector and reassemble it into an array of matrices, though this is probably
> not super effecient.
>
> - The linear regression implementation currently presumes it will be using
> SGD but I think that should be 'settable' as a parameter, because if not-
> why do we have all of those other nice SGD methods just hanging out?
> Similarly the loss function / partial loss is hard coded.  I reccomend
> making the current setup the 'defaults' of a 'setOptimizer' method.  I.e.
> if you want to just run a MLR you can do it based on the examples, but if
> you want to use a fancy optimizer you can create it from existing methods,
> or make your own, then call something like `mlr.setOptimizer( myOptimizer
> )`
>
> - and more
>
> At any rate- if some people could weigh in / direct me how to proceed that
> would be swell.
>
> Thanks!
> tg
>
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>


Re: a typical ML algorithm flow

2016-03-23 Thread Theodore Vasiloudis
Just realized what I wrote is wrong and probably doesn't apply here.

The problem I described relates to modifying a *secondary* dataset
as you iterate over a primary one.

Taking SGD as an example, you would iterate over a weights dataset,
modifying  it using the native Flink iterations that Till talked about.
The problem comes from the fact that we need at every iteration to take
a different sample from *another* dataset (which is our training data),
in a sense modifying it as well at every iteration; *that *is not currently
possible AFAIK.

On Wed, Mar 23, 2016 at 10:50 AM, Till Rohrmann <trohrm...@apache.org>
wrote:

> Hi Dmitriy,
>
> I’m not sure whether I’ve understood your question correctly, so please
> correct me if I’m wrong.
>
> So you’re asking whether it is a problem that
>
> stat1 = A.map.reduce
> A = A.update.map(stat1)
>
> are executed on the same input data set A and whether we have to cache A
> for that, right? I assume you’re worried that A is calculated twice.
>
> Since you don’t have a API call which triggers eager execution of the data
> flow, the map.reduce and map(stat1) call will only construct the data flow
> of your program. Both operators will depend on the result of A which is
> only once calculated (when execute, collect or count is called) and then
> sent to the map.reduce and map(stat1) operator.
>
> However, it is not recommended using an explicit loop to do iterative
> computations with Flink. The problem here is that you will basically unroll
> the loop and construct a long pipeline with the operations of each
> iterations. Once you execute this long pipeline you will face considerable
> memory fragmentation, because every operator will get a proportional
> fraction of the available memory assigned. Even worse, if you trigger the
> execution of your data flow to evaluate the convergence criterion, you will
> execute for each iteration the complete pipeline which has been built up so
> far. Thus, you’ll end up with a quadratic complexity in the number of
> iterations. Therefore, I would highly recommend using Flink’s built in
> support for native iterations which won’t suffer from this problem or to
> materialize at least for every n iterations the intermediate result. At the
> moment this would mean to write the data to some sink and then reading it
> from there again.
>
> I hope this answers your question. If not, then don’t hesitate to ask me
> again.
>
> Cheers,
> Till
> ​
>
> On Wed, Mar 23, 2016 at 10:19 AM, Theodore Vasiloudis <
> theodoros.vasilou...@gmail.com> wrote:
>
> > Hello Dmitriy,
> >
> > If I understood correctly what you are basically talking about modifying
> a
> > DataSet as you iterate over it.
> >
> > AFAIK this is currently not possible in Flink, and indeed it's a real
> > bottleneck for ML algorithms. This is the reason our current
> > SGD implementation does a pass over the whole dataset at each iteration,
> > since we cannot take a sample from the dataset
> > and iterate only over that (so it's not really stochastic).
> >
> > The relevant JIRA is here:
> > https://issues.apache.org/jira/browse/FLINK-2396
> >
> > I would love to start a discussion on how we can proceed to fix this.
> >
> > Regards,
> > Theodore
> >
> > On Tue, Mar 22, 2016 at 9:56 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > probably more of a question for Till:
> > >
> > > Imagine a common ML algorithm flow that runs until convergence.
> > >
> > > typical distributed flow would be something like that (e.g. GMM EM
> would
> > be
> > > exactly like that):
> > >
> > > A: input
> > >
> > > do {
> > >
> > >stat1 = A.map.reduce
> > >A = A.update-map(stat1)
> > >conv = A.map.reduce
> > > } until conv > convThreshold
> > >
> > > There probably could be 1 map-reduce step originating on A to compute
> > both
> > > convergence criteria statistics and udpate statistics in one step. not
> > the
> > > point.
> > >
> > > The point is that update and map.reduce originate on the same dataset
> > > intermittently.
> > >
> > > In spark we would normally commit A to a object tree cache so that data
> > is
> > > available to subsequent map passes without any I/O or serialization
> > > operations, thus insuring high rate of iterations.
> > >
> > > We observe the same pattern pretty much everywhere. clustering,
> > > probabilistic algorithms, even batch gradient descent of quasi newton
> > > algorithms fitting.
> > >
> > > How do we do something like that, for example, in FlinkML?
> > >
> > > Thoughts?
> > >
> > > thanks.
> > >
> > > -Dmitriy
> > >
> >
>


Re: a typical ML algorithm flow

2016-03-23 Thread Theodore Vasiloudis
Hello Dmitriy,

If I understood correctly what you are basically talking about modifying a
DataSet as you iterate over it.

AFAIK this is currently not possible in Flink, and indeed it's a real
bottleneck for ML algorithms. This is the reason our current
SGD implementation does a pass over the whole dataset at each iteration,
since we cannot take a sample from the dataset
and iterate only over that (so it's not really stochastic).

The relevant JIRA is here: https://issues.apache.org/jira/browse/FLINK-2396

I would love to start a discussion on how we can proceed to fix this.

Regards,
Theodore

On Tue, Mar 22, 2016 at 9:56 PM, Dmitriy Lyubimov  wrote:

> Hi,
>
> probably more of a question for Till:
>
> Imagine a common ML algorithm flow that runs until convergence.
>
> typical distributed flow would be something like that (e.g. GMM EM would be
> exactly like that):
>
> A: input
>
> do {
>
>stat1 = A.map.reduce
>A = A.update-map(stat1)
>conv = A.map.reduce
> } until conv > convThreshold
>
> There probably could be 1 map-reduce step originating on A to compute both
> convergence criteria statistics and udpate statistics in one step. not the
> point.
>
> The point is that update and map.reduce originate on the same dataset
> intermittently.
>
> In spark we would normally commit A to a object tree cache so that data is
> available to subsequent map passes without any I/O or serialization
> operations, thus insuring high rate of iterations.
>
> We observe the same pattern pretty much everywhere. clustering,
> probabilistic algorithms, even batch gradient descent of quasi newton
> algorithms fitting.
>
> How do we do something like that, for example, in FlinkML?
>
> Thoughts?
>
> thanks.
>
> -Dmitriy
>


Re: XGBoost on DataFlow and Flink

2016-03-12 Thread Theodore Vasiloudis
Hello Tianqui,

Yes that definitely sounds interesting for us and we are looking forward to
help out with the implementation.

Regards,
Theodore
-- 
Sent from a mobile device. May contain autocorrect errors.
On Mar 12, 2016 11:29 AM, "Simone Robutti" 
wrote:

> This is a really interesting approach. The idea of a ML library over
> DataFlow is probably a winning move and I hope it will stop the
> proliferation of worthless reimplementation that is taking place in the big
> data world. Do you think that DataFlow posed specific problems to your
> work? Does it missing something that you had to fill in with your work?
>
> Here at RadicalBit we are interested both in DataFlow/Apache Beam and in
> distributed ML and your approach to us look the best and I hope more and
> more teams follow your example, maybe integrating existing libraries like
> H2O with DataFlow.
>
> Keep us updated if you plan to develop other algorithms.
>
> 2016-03-11 21:32 GMT+01:00 Tianqi Chen :
>
> > Hi Flink Developers
> > I am sending this email to let you know about XGBoost4J, a package
> that
> > we are planning to announce next week . Here is the draft version of the
> > post
> > https://github.com/dmlc/xgboost/blob/master/doc/jvm/xgboost4j-intro.md
> >
> > In short, XGBoost is a machine learning package that is used by more
> > than half of the machine challenge winning solutions and is already
> widely
> > used in industry. The distributed version scale to billion examples(10x
> > faster than spark.mllib in the experiment) with fewer resources (see .
> > http://arxiv.org/abs/1603.02754)
> >
> > We are interested in putting distributed XGBoost into all Dataflow
> > platforms include Flink. This does not mean we re-implement it on Flink.
> > But instead we build a portable API that has a communication library, and
> > being able to run on different DataFlow programs.
> >
> > We hope this can benefit the Flink users, to enable them to get
> access
> > to one of the state-of-art machine learning algorithm. I am sending this
> > email to the mail-list to let you know about it, and hoping to get some
> > contributors to help improving  the XGBoost Flink API to be more
> compatible
> > with current FlinkML stack.  We also hope to get some support from the
> > system side, to enable some abstraction needed in XGBoost for using
> > multiple threads within even one slot for maximum performance.
> >
> >
> > Let us know about your thoughts.
> >
> > Cheers
> >
> > Tianqi
> >
>


Congrats on 1000 stars on Github

2016-02-26 Thread Theodore Vasiloudis
I'm sure others noticed this as well yesterday, but the project has passed
1000 stars on Github,
just in time for the 1.0 release ;)

Here's to the next 1000!

--Theo


Re: Dense matricies in FlinkML

2016-02-19 Thread Theodore Vasiloudis
Yes, what I meant by BLAS calls is to have bindings to the relevant Breeze
or netlib-java implementation .

-- 
Sent from a mobile device. May contain autocorrect errors.
On Feb 19, 2016 3:06 PM, "Trevor Grant" <trevor.d.gr...@gmail.com> wrote:

> That makes sense, a more accurate question would be, why does Vector.scala
> provide BLAS methods (functions, routines, whatever you call them), and
> Matrix doesn't?
>
> I assume they are there for speed (?) so does Matrix need them?
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Fri, Feb 19, 2016 at 6:29 AM, Till Rohrmann <trohrm...@apache.org>
> wrote:
>
> > The idea was actually to leverage existing linear algebra libraries such
> as
> > breeze instead of building another blas implementation which will never
> be
> > as good as the ones out there.
> >
> > Cheers,
> > Till
> >
> > On Fri, Feb 19, 2016 at 9:48 AM, Theodore Vasiloudis <
> > theodoros.vasilou...@gmail.com> wrote:
> >
> > > Just to note: This should be a separate PR if you plan on contributing
> > > this.
> > >
> > > On Thu, Feb 18, 2016 at 7:54 PM, Márton Balassi <
> > balassi.mar...@gmail.com>
> > > wrote:
> > >
> > > > Hi guys,
> > > >
> > > > They are at least already registered for serialization [1], so there
> > > should
> > > > be no intentional conflict as Theo has suggested.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/common/FlinkMLTools.scala#L67-L73
> > > >
> > > > Best,
> > > >
> > > > Marton
> > > >
> > > > On Thu, Feb 18, 2016 at 7:47 PM, Theodore Vasiloudis <
> > > > theodoros.vasilou...@gmail.com> wrote:
> > > >
> > > > > Hello Trevor,
> > > > >
> > > > > IIRC it was mostly that they weren't needed at the time. Feel free
> to
> > > > add,
> > > > > along with BLAS ops.
> > > > >
> > > > > Cheers,
> > > > > Theo
> > > > >
> > > > > On Thu, Feb 18, 2016 at 5:14 PM, Trevor Grant <
> > > trevor.d.gr...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Is there a specific reason vectors are imported from Breeze and
> > > > matrices
> > > > > > aren't?
> > > > > >
> > > > > > Specifically I need to take the dot product of a matrix and a
> > vector
> > > > and
> > > > > a
> > > > > > matrix and a matrix. Was wondering if there is a theoretical
> reason
> > > or
> > > > it
> > > > > > just wasn't needed at the time.  Looks like some of this got
> lifted
> > > > from
> > > > > > Spark (per the comments in the code, not trying to troll).
> > > > > >
> > > > > > I need dot products of matrices and vectors which isn't
> > implemented.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > > > tg
> > > > > >
> > > > > >
> > > > > >
> > > > > > Trevor Grant
> > > > > > Data Scientist
> > > > > > https://github.com/rawkintrevo
> > > > > > http://stackexchange.com/users/3002022/rawkintrevo
> > > > > > http://trevorgrant.org
> > > > > >
> > > > > > *"Fortunate is he, who is able to know the causes of things."
> > > -Virgil*
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Dense matricies in FlinkML

2016-02-19 Thread Theodore Vasiloudis
Just to note: This should be a separate PR if you plan on contributing this.

On Thu, Feb 18, 2016 at 7:54 PM, Márton Balassi <balassi.mar...@gmail.com>
wrote:

> Hi guys,
>
> They are at least already registered for serialization [1], so there should
> be no intentional conflict as Theo has suggested.
>
> [1]
>
> https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/common/FlinkMLTools.scala#L67-L73
>
> Best,
>
> Marton
>
> On Thu, Feb 18, 2016 at 7:47 PM, Theodore Vasiloudis <
> theodoros.vasilou...@gmail.com> wrote:
>
> > Hello Trevor,
> >
> > IIRC it was mostly that they weren't needed at the time. Feel free to
> add,
> > along with BLAS ops.
> >
> > Cheers,
> > Theo
> >
> > On Thu, Feb 18, 2016 at 5:14 PM, Trevor Grant <trevor.d.gr...@gmail.com>
> > wrote:
> >
> > > Is there a specific reason vectors are imported from Breeze and
> matrices
> > > aren't?
> > >
> > > Specifically I need to take the dot product of a matrix and a vector
> and
> > a
> > > matrix and a matrix. Was wondering if there is a theoretical reason or
> it
> > > just wasn't needed at the time.  Looks like some of this got lifted
> from
> > > Spark (per the comments in the code, not trying to troll).
> > >
> > > I need dot products of matrices and vectors which isn't implemented.
> > >
> > > Thoughts?
> > >
> > > tg
> > >
> > >
> > >
> > > Trevor Grant
> > > Data Scientist
> > > https://github.com/rawkintrevo
> > > http://stackexchange.com/users/3002022/rawkintrevo
> > > http://trevorgrant.org
> > >
> > > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> > >
> >
>


Re: Dense matricies in FlinkML

2016-02-18 Thread Theodore Vasiloudis
Hello Trevor,

IIRC it was mostly that they weren't needed at the time. Feel free to add,
along with BLAS ops.

Cheers,
Theo

On Thu, Feb 18, 2016 at 5:14 PM, Trevor Grant 
wrote:

> Is there a specific reason vectors are imported from Breeze and matrices
> aren't?
>
> Specifically I need to take the dot product of a matrix and a vector and a
> matrix and a matrix. Was wondering if there is a theoretical reason or it
> just wasn't needed at the time.  Looks like some of this got lifted from
> Spark (per the comments in the code, not trying to troll).
>
> I need dot products of matrices and vectors which isn't implemented.
>
> Thoughts?
>
> tg
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>


Opening a discussion on FlinkML

2016-02-12 Thread Theodore Vasiloudis
Hello all,

I would like to get a conversation started on how we plan to move forward
with FlinkML.

Development on the library currently has been mostly dormant for the past 6
months,

mainly I believe because of the lack of available committers to review PRs.

Last month we got together with Till and Marton and talked about how we
could try to

solve this and ensure continued development of the library.

We see 3 possible paths we could take:

   1.

   Externalize the library, creating a new repository under the Apache
   Flink project. This decouples the development of FlinkML from the Flink
   release cycle, allowing us to move faster and incorporate new features as
   they become available. As FlinkML is a library under development tying it
   to specific versions does not make much sense anyway. The library would
   depend on the latest snapshot version of Flink. It would then be possible
   for the Flink distribution to cherry-pick parts of the library to be
   included with the core distribution.
   2.

   Keep the development under the main Flink project but bring in new
   committers. This would mean that the development remains as is and is tied
   to core Flink releases, but new worked should get merged at much more
   regular intervals through the help of committers other than Till. Marton
   Balassi has volunteered for that role and I hope that more might take up
   that role.
   3. A third option is to fork FlinkML on a repository on which we are
   able to commit freely (again through PRs and reviews of course) and merge
   good parts back into the main repo once in a while. This allows for faster
   progress and more experimental work but obviously creates fragmentation.


I would like to hear your thoughts on these three options, as well as
discuss other

alternatives that could help move FlinkML forward.

Cheers,
Theodore


Re: Case style anonymous functions not supported by Scala API

2016-02-09 Thread Theodore Vasiloudis
Thanks for bringing this up Stefano, it would a very welcome addition
indeed.

I like the approach of having extensions through implicits as well. IMHO
though this should be the default
behavior, without the need to add another import.

On Tue, Feb 9, 2016 at 1:29 PM, Stefano Baghino <
stefano.bagh...@radicalbit.io> wrote:

> I see, thanks for the tip! I'll work on it; meanwhile, I've added some
> functions and Scaladoc:
>
> https://github.com/radicalbit/flink/blob/1159-implicit/flink-scala/src/main/scala/org/apache/flink/api/scala/extensions/package.scala
>
> On Tue, Feb 9, 2016 at 12:01 PM, Till Rohrmann 
> wrote:
>
> > Not the DataSet but the JoinDataSet and the CoGroupDataSet do in the form
> > of an apply function.
> > ​
> >
> > On Tue, Feb 9, 2016 at 11:09 AM, Stefano Baghino <
> > stefano.bagh...@radicalbit.io> wrote:
> >
> > > Sure, it was just a draft. I agree that filter and mapPartition make
> > sense,
> > > but coGroup and join don't look like they take a function.
> > >
> > > On Tue, Feb 9, 2016 at 10:08 AM, Till Rohrmann 
> > > wrote:
> > >
> > > > This looks like a good design to me :-) The only thing is that it is
> > not
> > > > complete. For example, the filter, mapPartition, coGroup and join
> > > functions
> > > > are missing.
> > > >
> > > > Cheers,
> > > > Till
> > > > ​
> > > >
> > > > On Tue, Feb 9, 2016 at 1:18 AM, Stefano Baghino <
> > > > stefano.bagh...@radicalbit.io> wrote:
> > > >
> > > > > What do you think of something like this?
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/radicalbit/flink/commit/21a889a437875c88921c93e87d88a378c6b4299e
> > > > >
> > > > > In this way, several extensions can be collected in this package
> > object
> > > > and
> > > > > picked altogether or a-là-carte (e.g. import
> > > > > org.apache.flink.api.scala.extensions.AcceptPartialFunctions).
> > > > >
> > > > > On Mon, Feb 8, 2016 at 2:51 PM, Till Rohrmann <
> trohrm...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > I like the idea to support partial functions with Flink’s Scala
> > API.
> > > > > > However, I think that breaking the API and making it inconsistent
> > > with
> > > > > > respect to the Java API is not the best option. I would rather be
> > in
> > > > > favour
> > > > > > of the first proposal where we add a new method xxxWith via
> > implicit
> > > > > > conversions.
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > > ​
> > > > > >
> > > > > > On Sun, Feb 7, 2016 at 12:44 PM, Stefano Baghino <
> > > > > > stefano.bagh...@radicalbit.io> wrote:
> > > > > >
> > > > > > > It took me a little time but I was able to put together some
> > code.
> > > > > > >
> > > > > > > In this commit I just added a few methods renamed to prevent
> > > > > overloading,
> > > > > > > thus usable with PartialFunction instead of functions:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/radicalbit/flink/commit/aacd59e0ce98cccb66d48a30d07990ac8f345748
> > > > > > >
> > > > > > > In this other commit I coded the original proposal, renaming
> the
> > > > > methods
> > > > > > to
> > > > > > > obtain the same effect as before, but with lower friction for
> > Scala
> > > > > > > developers (and provided some usage examples):
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/radicalbit/flink/commit/33403878eebba70def42f73a1cb671d13b1521b5
> > > > > > >
> > > > > > > On Thu, Jan 28, 2016 at 6:16 PM, Stefano Baghino <
> > > > > > > stefano.bagh...@radicalbit.io> wrote:
> > > > > > >
> > > > > > > > Hi Stephan,
> > > > > > > >
> > > > > > > > thank you for the quick reply and for your feedback; I agree
> > with
> > > > you
> > > > > > > that
> > > > > > > > breaking changes have to taken very seriously.
> > > > > > > >
> > > > > > > > The rationale behind my proposal is that Scala users are
> > already
> > > > > > > > accustomed to higher-order functions that manipulate
> > collections
> > > > and
> > > > > it
> > > > > > > > would beneficial for them to have an API that tries to adhere
> > as
> > > > much
> > > > > > as
> > > > > > > > possible to the interface provided by the Scala Collections
> > API.
> > > > IMHO
> > > > > > > being
> > > > > > > > able to manipulate a DataSet or DataStream like a Scala
> > > collection
> > > > > > > > idiomatically would appeal to developers and reduce the
> > friction
> > > > for
> > > > > > them
> > > > > > > > to learn Flink.
> > > > > > > >
> > > > > > > > If we want to pursue the renaming path, I think these changes
> > > (and
> > > > > > > porting
> > > > > > > > the rest of the codebase, like `flink-ml` and
> `flink-contrib`,
> > to
> > > > the
> > > > > > new
> > > > > > > > method names) can be done in relatively little time. Since
> > Flink
> > > is
> > > > > > > > approaching a major release, I think it's a good time to
> > consider
> > > > > this
> > > > > > > > change, if the community deems it relevant.

[jira] [Created] (FLINK-3316) Links to Gelly and FlinkML libraries on main site broken

2016-02-02 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-3316:
--

 Summary: Links to Gelly and FlinkML libraries on main site broken
 Key: FLINK-3316
 URL: https://issues.apache.org/jira/browse/FLINK-3316
 Project: Flink
  Issue Type: Bug
  Components: Project Website
Reporter: Theodore Vasiloudis
Priority: Trivial


The landing page of flink.apache.org includes links to the Gelly and FlinkML 
libraries under the text:

{quote}
Flink also bundles libraries for domain-specific use cases:

1.Machine Learning library, and
2.Gelly, a graph processing API and library.
{quote}

These point to anchor links in the Features page that seem to longer exist.

I guess linking to the docs instead could be a solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Flink ML Vector and DenseVector

2016-01-18 Thread Theodore Vasiloudis
I agree with Till, the data types are different here so you need a custom
string vector.

The Vector abstraction in FlinkML is designed with numerical vectors in
mind.

On Mon, Jan 18, 2016 at 2:33 PM, Till Rohrmann  wrote:

> Hi Hilmi,
>
> I think in your case it makes sense to define a custom vector of strings.
> The easiest implementation could be an Array[String] or List[String].
>
> The reason why it does not make so much sense to make Vector and
> DenseVector
> generic is that these types are algebraic data types. How would you define
> algebraic operations such as scalar product, outer product, multiplication,
> etc. on a vector of strings? Then you would have to provide different
> implementations for the different type parameters.
>
> Cheers,
> Till
> ​
>
> On Mon, Jan 18, 2016 at 1:40 PM, Hilmi Yildirim 
> wrote:
>
> > Hi,
> > how I explained it in a previous E-Mail, I need a LabeledVector where the
> > label is also a vector. After we discussed this issue, I created a new
> > class named LabeledSequenceVector with the labels as a Vector. In my use
> > case, I want to train a POS-Tagger system, so the "vector" is a vector of
> > strings and the "labels" is also a vector of strings. If I use the Flink
> > Vector/DenseVector implementation then the vector does only have double
> > values but I need String values.
> >
> > Best Regards,
> > Hilmi
> >
> >
> > Am 18.01.2016 um 13:33 schrieb Chiwan Park:
> >
> >> Hi Hilmi,
> >>
> >> In NLP, which types are used for vector values? I think we can cover
> >> typical case using double values.
> >>
> >> On Jan 18, 2016, at 9:19 PM, Hilmi Yildirim 
> >>> wrote:
> >>>
> >>> Hi,
> >>> the Vector and DenseVector implementations of Flink ML only allow
> Double
> >>> values. But there are cases where the values are not Doubles, e.g. in
> NLP.
> >>> Does it make sense to make the implementations generic, i.e. Vector[T]
> and
> >>> DenseVector[T]?
> >>>
> >>> Best Regards,
> >>> Hilmi
> >>>
> >>> --
> >>> ==
> >>> Hilmi Yildirim, M.Sc.
> >>> Researcher
> >>>
> >>> DFKI GmbH
> >>> Intelligente Analytik für Massendaten
> >>> DFKI Projektbüro Berlin
> >>> Alt-Moabit 91c
> >>> D-10559 Berlin
> >>> Phone: +49 30 23895 1814
> >>>
> >>> E-Mail: hilmi.yildi...@dfki.de
> >>>
> >>> -
> >>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> >>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
> >>>
> >>> Geschaeftsfuehrung:
> >>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> >>> Dr. Walter Olthoff
> >>>
> >>> Vorsitzender des Aufsichtsrats:
> >>> Prof. Dr. h.c. Hans A. Aukes
> >>>
> >>> Amtsgericht Kaiserslautern, HRB 2313
> >>> -
> >>>
> >>> Regards,
> >> Chiwan Park
> >>
> >>
>


Re: LabeledVector with label vector

2016-01-05 Thread Theodore Vasiloudis
Generalizing the type of the label for the label vector is an idea we
played with when designing the current optimization framework.

We ended up deciding against it as the double type allows us to do
regressions and (multiclass) classification which should be the majority of
the use cases out there, while keeping the code simple.

Generalizing this to [T <: Serializable] is too broad I think. [T <:
Vector] is I think more reasonable, I cannot think of many cases where the
label in an optimization problems is something other than a vector/double.

Any change would require a number of changes in the optimization of course,
as optimizing for vector and double labels requires different handling of
error calculation etc but it should be doable.
Note however that since LabeledVector is such a core part of the library
any changes would involve a number of adjustments downstream.

Perhaps having different optimizers etc. for Vectors and double labels
makes sense, but I haven't put much though into this.


On Tue, Jan 5, 2016 at 12:17 PM, Chiwan Park  wrote:

> Hi Hilmi,
>
> Thanks for suggestion about type of labeled vector. Basically, I agree
> that your suggestion is reasonable. But, I would like to generialize
> `LabeledVector` like following example:
>
> ```
> case class LabeledVector[T <: Serializable](label: T, vector: Vector)
> extends Serializable {
>   // some implementations for LabeledVector
> }
> ```
>
> How about this implementation? If there are any other opinions, please
> send a email to mailing list.
>
> > On Jan 5, 2016, at 7:36 PM, Hilmi Yildirim 
> wrote:
> >
> > Hi,
> > in the ML-Pipeline of Flink we have the "LabeledVector" class. It
> consists of a vector and a label as a double value. Unfortunately, it is
> not applicable for sequence learning where the label is also a vector. For
> example, in NLP we have a vector of words and the label is a vector of the
> corresponding labels.
> >
> > The optimize function of the "Solver" class has a DateSet[LabeledVector]
> as input and, therefore, it is not applicable for sequence learning. I
> think the LabeledVector should be adapted that the label is a vector
> instead of a single Double value. What do you think?
> >
> > Best Regards,
> >
> > --
> > ==
> > Hilmi Yildirim, M.Sc.
> > Researcher
> >
> > DFKI GmbH
> > Intelligente Analytik für Massendaten
> > DFKI Projektbüro Berlin
> > Alt-Moabit 91c
> > D-10559 Berlin
> > Phone: +49 30 23895 1814
> >
> > E-Mail: hilmi.yildi...@dfki.de
> >
> > -
> > Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> > Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
> >
> > Geschaeftsfuehrung:
> > Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> > Dr. Walter Olthoff
> >
> > Vorsitzender des Aufsichtsrats:
> > Prof. Dr. h.c. Hans A. Aukes
> >
> > Amtsgericht Kaiserslautern, HRB 2313
> > -
> >
>
> Regards,
> Chiwan Park
>
>
>


Re: Scala 2.10/2.11 Maven dependencies

2015-10-26 Thread Theodore Vasiloudis
+1 for having binaries, I'm working on a Spark application currently with
Scala 2.11 and having to rebuild everything when deploying e.g. to EC2 is a
pain.

On Mon, Oct 26, 2015 at 4:22 PM, Ufuk Celebi  wrote:

> I agree with Till, but is this something you want to address in this
> release already?
>
> I would postpone it to 1.0.0.
>
> – Ufuk
>
> > On 26 Oct 2015, at 16:17, Till Rohrmann  wrote:
> >
> > I would be in favor of deploying also Scala 2.11 artifacts to Maven since
> > more and more people will try out Flink with Scala 2.11. Having the
> > dependencies in the Maven repository makes it considerably easier for
> > people to get their Flink jobs running.
> >
> > Furthermore, I observed that people are not aware that our deployed
> > artifacts, e.g. flink-runtime, are built with Scala 2.10. As a
> consequence,
> > they mix flink dependencies with other dependencies pulling in Scala 2.11
> > and then they wonder that the program crashes. It would be, imho, clearer
> > if all our dependencies which depend on a specific Scala version would
> have
> > the corresponding Scala suffix appended.
> >
> > Adding the 2.10 suffix would also spare us the hassle of upgrading to a
> > newer Scala version in the future, because then the artifacts wouldn't
> > share the same artifact name.
> >
> > Cheers,
> > Till
> >
> > On Mon, Oct 26, 2015 at 4:04 PM, Maximilian Michels 
> wrote:
> >
> >> Hi Flinksters,
> >>
> >> We have recently committed an easy way to change Flink's Scala version.
> The
> >> question arises now whether we should ship Scala 2.11 as binaries and
> via
> >> Maven. For the rc0, I created all binaries twice, for Scala 2.10 and
> 2.11.
> >> However, I didn't create Maven artifacts. This follows our current
> shipping
> >> strategy where we only ship Hadoop1 and Hadoop 2.3.0 Maven dependencies
> but
> >> additionally Hadoop 2.4, 2.6, 2.7 as binaries.
> >>
> >> Should we also upload Maven dependencies for Scala 2.11?
> >>
> >> If so, the next question arises: What version pattern should we have for
> >> the Flink Scala 2.11 dependencies? For Hadoop, we append -hadoop1 to the
> >> VERSION, e.g. artifactID=flink-core, version=0.9.1-hadoop1.
> >>
> >> However, it is common practice to append the suffix to the artifactID of
> >> the Maven dependency, e.g. artifactID=flink-core_2.11, version=0.9.1.
> This
> >> has mostly historic reasons but is widely used.
> >>
> >> Whatever naming pattern we choose, it should be consistent. I would be
> in
> >> favor of changing our artifact names to contain the Hadoop and Scala
> >> version. This would also imply that all Scala dependent Maven modules
> >> receive a Scala suffix (also the default Scala 2.10 modules).
> >>
> >> Cheers,
> >> Max
> >>
>
>


[jira] [Created] (FLINK-2860) The mlr object from the FlinkML Getting Started code example uses an undefined argument

2015-10-16 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2860:
--

 Summary: The mlr object from the FlinkML Getting Started code 
example uses an undefined argument
 Key: FLINK-2860
 URL: https://issues.apache.org/jira/browse/FLINK-2860
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Affects Versions: 0.9.1
Reporter: Theodore Vasiloudis
Priority: Trivial


The getting started guide code example uses the following code:
{code}
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...

val mlr = MultipleLinearRegression()
  .setStepsize(1.0)
  .setIterations(100)
  .setConvergenceThreshold(0.001)

mlr.fit(trainingData, parameters)
{code}

The call to {{mlr.fit()}} uses a {{parameters}} argument that is unnecessary, 
we should remove that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] Introducing a review process for pull requests

2015-10-07 Thread Theodore Vasiloudis
>
> Could we maybe do a "PR overall status assessment" once per week or so,
> where we find those problematic PRs and try to assign them / close them?


I like this idea, as it would raise  awareness about lingering PRs. Does
anybody know if there is
some way to integrate this into JIRA, so we can easily see (and
filter/sort) lingering PRs?

Cheers,
Theo

On Wed, Oct 7, 2015 at 12:04 PM, Vasiliki Kalavri <vasilikikala...@gmail.com
> wrote:

> Hey,
>
> I agree that we need to organize the PR process. A PR management tool would
> be great.
>
> However, it seems to me that the shepherding process described is -more or
> less- what we've already been doing. There is usually a person that reviews
> the PR and kind-of drives the process. Maybe making this explicit will make
> things go faster.
>
> There is a problem I see here, quite related to what Theo also mentioned.
> For how long do we let a PR lingering around without a shepherd? What do we
> do if nobody volunteers?
> Could we maybe do a "PR overall status assessment" once per week or so,
> where we find those problematic PRs and try to assign them / close them?
>
> Finally, I think shepherding one's own PR is a bad idea (the review part)
> unless it's something trivial.
>
> Cheers and see you very soon,
> -Vasia.
> On Oct 7, 2015 11:24 AM, "Matthias J. Sax" <mj...@apache.org> wrote:
>
> > Ok. That makes sense. So most people will need to change behavior and
> > start discussions in JIRA and not over dev list. Furthermore, issues
> > list must be monitored more carefully... (I personally, watch dev
> > carefully and only skip over issues list right now)
> >
> > -Matthias
> >
> > On 10/07/2015 10:22 AM, Fabian Hueske wrote:
> > > @Matthias: That's a good point. Each PR should be backed by a JIRA
> issue.
> > > If that's not the case, we have to make the contributor aware of that
> > rule.
> > > I'll update the process to keep all discussions in JIRA (will be
> mirrored
> > > to issues ML), OK?
> > >
> > > @Theo: You are right. Adding this process won't be the silver bullet to
> > fix
> > > all PR related issues.
> > > But I hope it will help to improve the overall situation.
> > >
> > > Are there any other comment? Otherwise I would start to prepare add
> this
> > to
> > > our website.
> > >
> > > Thanks, Fabian
> > >
> > > 2015-10-06 9:46 GMT+02:00 Theodore Vasiloudis <
> > > theodoros.vasilou...@gmail.com>:
> > >
> > >> One problem that we are seeing with FlinkML PRs is that there are
> simply
> > >> not enough commiters to "shepherd" all of them.
> > >>
> > >> While I think this process would help generally, I don't think it
> would
> > >> solve this kind of problem.
> > >>
> > >> Regards,
> > >> Theodore
> > >>
> > >> On Mon, Oct 5, 2015 at 3:28 PM, Matthias J. Sax <mj...@apache.org>
> > wrote:
> > >>
> > >>> One comment:
> > >>> We should ensure that contributors follow discussions on the dev
> > mailing
> > >>> list. Otherwise, they might miss important discussions regarding
> their
> > >>> PR (what happened already). Thus, the contributor was waiting for
> > >>> feedback on the PR, while the reviewer(s) waited for the PR to be
> > >>> updated according to the discussion consensus, resulting in
> unnecessary
> > >>> delays.
> > >>>
> > >>> -Matthias
> > >>>
> > >>>
> > >>>
> > >>> On 10/05/2015 02:18 PM, Fabian Hueske wrote:
> > >>>> Hi everybody,
> > >>>>
> > >>>> Along with our efforts to improve the “How to contribute” guide, I
> > >> would
> > >>>> like to start a discussion about a setting up a review process for
> > pull
> > >>>> requests.
> > >>>>
> > >>>> Right now, I feel that our PR review efforts are often a bit
> > >> unorganized.
> > >>>> This leads to situation such as:
> > >>>>
> > >>>> - PRs which are lingering around without review or feedback
> > >>>> - PRs which got a +1 for merging but which did not get merged
> > >>>> - PRs which have been rejected after a long time
> > >>>> - PRs which became irrelevant because some component was rewritten
> > >>&g

Re: [DISCUSS] Introducing a review process for pull requests

2015-10-06 Thread Theodore Vasiloudis
One problem that we are seeing with FlinkML PRs is that there are simply
not enough commiters to "shepherd" all of them.

While I think this process would help generally, I don't think it would
solve this kind of problem.

Regards,
Theodore

On Mon, Oct 5, 2015 at 3:28 PM, Matthias J. Sax  wrote:

> One comment:
> We should ensure that contributors follow discussions on the dev mailing
> list. Otherwise, they might miss important discussions regarding their
> PR (what happened already). Thus, the contributor was waiting for
> feedback on the PR, while the reviewer(s) waited for the PR to be
> updated according to the discussion consensus, resulting in unnecessary
> delays.
>
> -Matthias
>
>
>
> On 10/05/2015 02:18 PM, Fabian Hueske wrote:
> > Hi everybody,
> >
> > Along with our efforts to improve the “How to contribute” guide, I would
> > like to start a discussion about a setting up a review process for pull
> > requests.
> >
> > Right now, I feel that our PR review efforts are often a bit unorganized.
> > This leads to situation such as:
> >
> > - PRs which are lingering around without review or feedback
> > - PRs which got a +1 for merging but which did not get merged
> > - PRs which have been rejected after a long time
> > - PRs which became irrelevant because some component was rewritten
> > - PRs which are lingering around and have been abandoned by their
> > contributors
> >
> > To address these issues I propose to define a pull request review process
> > as follows:
> >
> > 1. [Get a Shepherd] Each pull request is taken care of by a shepherd.
> > Shepherds are committers that voluntarily sign up and *feel responsible*
> > for helping the PR through the process until it is merged (or discarded).
> > The shepherd is also the person to contact for the author of the PR. A
> > committer becomes the shepherd of a PR by dropping a comment on Github
> like
> > “I would like to shepherd this PR”. A PR can be reassigned with lazy
> > consensus.
> >
> > 2. [Accept Decision] The first decision for a PR should be whether it is
> > accepted or not. This depends on a) whether it is a desired feature or
> > improvement for Flink and b) whether the higher-level solution design is
> > appropriate. In many cases such as bug fixes or discussed features or
> > improvements, this should be an easy decision. In case of more a complex
> > feature, the discussion should have been started when the mandatory JIRA
> > was created. If it is still not clear whether the PR should be accepted
> or
> > not, a discussion should be started in JIRA (a JIRA issue needs to be
> > created if none exists). The acceptance decision should be recorded by a
> > “+1 to accept” message in Github. If the PR is not accepted, it should be
> > closed in a timely manner.
> >
> > 3. [Merge Decision] Once a PR has been “accepted”, it should be brought
> > into a mergeable state. This means the community should quickly react on
> > contributor questions or PR updates. Everybody is encouraged to review
> pull
> > requests and give feedback. Ideally, the PR author does not have to wait
> > for a long time to get feedback. The shepherd of a PR should feel
> > responsible to drive this process forward. Once the PR is considered to
> be
> > mergeable, this should be recorded by a “+1 to merge” message in Github.
> If
> > the pull request becomes abandoned at some point in time, it should be
> > either taken over by somebody else or be closed after a reasonable time.
> > Again, this can be done by anybody, but the shepherd should feel
> > responsible to resolve the issue.
> >
> > 4. Once, the PR is in a mergeable state, it should be merged. This can be
> > done by anybody, however the shepherd should make sure that it happens
> in a
> > timely manner.
> >
> > IMPORTANT: Everybody is encouraged to discuss pull requests, give
> feedback,
> > and merge pull requests which are in a good shape. However, the shepherd
> > should feel responsible to drive a PR forward if nothing happens.
> >
> > By defining a review process for pull requests, I hope we can
> >
> > - Improve the satisfaction of and interaction with contributors
> > - Improve and speed-up the review process of pull requests.
> > - Close irrelevant and stale PRs more timely
> > - Reduce the effort for code reviewing
> >
> > The review process can be supported by some tooling:
> >
> > - A QA bot that checks the quality of pull requests such as increase of
> > compiler warnings, code style, API changes, etc.
> > - A PR management tool such as Sparks PR dashboard (see:
> > https://spark-prs.appspot.com/)
> >
> > I would like to start discussions about using such tools later as
> separate
> > discussions.
> >
> > Looking forward to your feedback,
> > Fabian
> >
>
>


Re: Flink ML linear regression issue

2015-09-18 Thread Theodore Vasiloudis
+1, having the convenient creation of pipelines for Java is more of a long
term project, but we should make it possible to manually create pipelines
in Java.

On Fri, Sep 18, 2015 at 11:15 AM, Till Rohrmann 
wrote:

> Hi Alexey and Hanan,
>
> one of FlinkML’s feature is the flexible pipelining mechanism. It allows
> you to chain multiple transformers with a trailing predictor to form a data
> analysis pipeline. In order to support multiple input types, the actual
> program logic (matching for the type) is assembled at compile time by the
> Scala compiler using implicits. That is also the reason why you see in Java
> the fourth parameter fitOperation when calling
> multipleLinearRegression.fit() which in Scala is an implicit parameter. In
> theory, it is possible to construct the pipelines yourself in Java by
> assembling explicitly the respective implicit operations. However, I would
> refrain from doing so, because it is error prone and laborious.
>
> At the moment, I don’t really see an easy solution how to port the
> pipelining mechanism to Java (8), because of the missing feature of
> implicits. However, what we could do is to provide fit, predict and
> transform method which can be used without the chaining support. Then you
> lose the pipelining, but you can do it manually by calling the methods
> (e.g. fit and transform) for each stage. We could add a thin Java layer
> which calls the Scala methods with the correctly instantiated operations.
>
> Cheers,
> Till
> ​
>
> On Thu, Sep 17, 2015 at 7:05 PM, Alexey Sapozhnikov 
> wrote:
>
> > Hello everyone.
> >
> > Do you have a sample in Java how to implement Flink
> > MultipleLinearRegression example?
> > Scala is great, however we would like to see the exact example we could
> > invoke it from Java if it is possible.
> > Thanks and sorry for the interrupt.
> >
> >
> >
> > On Thu, Sep 17, 2015 at 4:27 PM, Hanan Meyer  wrote:
> >
> > > Hi
> > >
> > > I'm using Flink ML 9.2.1 in order to perform a multiple linear
> regression
> > > with a csv data file.
> > >
> > > The Scala sample code for it is pretty straightforward:
> > > val mlr = MultipleLinearRegression()
> > >
> > > val parameters = ParameterMap()
> > >
> > > parameters.add(MultipleLinearRegression.Stepsize, 2.0)
> > > parameters.add(MultipleLinearRegression.Iterations, 10)
> > > parameters.add(MultipleLinearRegression.ConvergenceThreshold, 0.001)
> > > val inputDS = env.fromCollection(data)
> > >
> > > mlr.fit(inputDS, parameters)
> > >
> > > When I'm using Java(8) the fit method includes 3 parameters
> > > 1. dataset
> > > 2.parameters
> > > 3. object which implements -fitOperation interface
> > >
> > > multipleLinearRegression.fit(regressionDS, parameters,fitOperation);
> > >
> > > Is there a need to  implement the fitOperation interface which have
> been
> > > already
> > > implemented in Flinks ml source code.
> > >
> > > Another option is using MultipleLinearRegression.fitMLR() method ,but I
> > > haven't found a way to pass the train dataset to it as a parameter or
> by
> > > setter.
> > >
> > > I'll be more than happy if you could guide me how to implement it in
> Java
> > >
> > > Thanks
> > >
> > > Hanan Meyer
> > >
> > >
> > >
> > >
> > >
> >
> >
> > --
> >
> > *Regards*
> >
> > *Alexey Sapozhnikov*
> > CTO& Co-Founder
> > Scalabillit Inc
> > Aba Even 10-C, Herzelia, Israel
> > M : +972-52-2363823
> > E : ale...@scalabill.it
> > W : http://www.scalabill.it
> > YT - https://youtu.be/9Rj309PTOFA
> > Map:http://mapta.gs/Scalabillit
> > Revolutionizing Proof-of-Concept
> >
>


Re: SGD Effective Learning Rate

2015-09-01 Thread Theodore Vasiloudis
I would also vote for option 1, implemented through a new (string?)
Parameter for SGD.

Also, see a previous discussion here

about adaptive learning rates.

On Mon, Aug 31, 2015 at 5:42 PM, Trevor Grant 
wrote:

> https://issues.apache.org/jira/browse/FLINK-1994
>
> There are two ways to set the effective learning rate:
>
>
> Method 1) Several pre-baked ways to calculate the effective learning rate,
> set as a switch. E.g.:
> val effectiveLearningRate = optimizationMethod match
> {
>   // original effective learning rate method for backward compatability
>  case 0 => learningRate/Math.sqrt(iteration)
>   // These come straight from sklearn
>   case 1 => learningRate
>   case 2 => 1 / (regularizationConstant * iteration)
>   case 3 => learningRate / Math.pow(iteration, 0.5) ...
> }
>
> Method2) Make the calculation definable by the user. E.g. introduce a
> function to the class which maybe overridden.
>
> This is a classic trade-off between ease of use and functionality. Method 1
> is easier for novice users/users who are migrating from sklearn. Method2
> will be more extensible- letting users write any old effective learning
> rate calculation they want.
>
> I am leaning toward method 1 because how many people really are writing out
> their own custom effective learning rate (as long as there is a fairly good
> number of 'prebaked' calculators available, and because if someone really
> wants to add a method, it simply requires adding another case.
>
> I want to open this up in case anyone has an opinion, just in case.
>
> Best,
> tg
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>


Re: FYI: Blog Post on Flink's Streaming Performance and Fault Tolerance

2015-08-05 Thread Theodore Vasiloudis
Great post Stephan! A small note: the code for Google Dataflow does display
correctly for me, I'm getting lt and gt instead of 

On Wed, Aug 5, 2015 at 4:11 PM, Stephan Ewen se...@apache.org wrote:

 Hi all!

 We just published a blog post about how streaming fault tolerance
 mechanisms evolved, and what kind of performance Flink gets with its
 checkpointing mechanism.

 I think it is a pretty interesting read for people that are interested in
 Flink or data streaming in general.

 The blog post talks about:

   - Fault tolerance techniques, starting from acknowledgements, over micro
 batches, to transactional updates and distributed snapshots.

   - Performance of Flink, throughput, latency, and tradeoffs.

   - A chaos monkey experiment where computation continues strongly
 consistent even when periodically killing workers.


 Comments welcome!

 Greetings,
 Stephan



[jira] [Created] (FLINK-2342) Add new fit operation and more tests for StandardScaler

2015-07-10 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2342:
--

 Summary: Add new fit operation and more tests for StandardScaler
 Key: FLINK-2342
 URL: https://issues.apache.org/jira/browse/FLINK-2342
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
Priority: Minor
 Fix For: 0.10


StandardScaler currently has a transform operation for types (Vector, Double), 
but no  corresponding fit operation.

The test cases also do not cover all the possible types that we can call fit 
and transform on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2321) The seed for the SVM classifier is currently static

2015-07-06 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2321:
--

 Summary: The seed for the SVM classifier is currently static
 Key: FLINK-2321
 URL: https://issues.apache.org/jira/browse/FLINK-2321
 Project: Flink
  Issue Type: Bug
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
 Fix For: 0.10


The seed for the SVM algorithm in FlinkML has a default value of 0, meaning 
that if it's not set, we always have the same seed when running the algorithm.

What should happen instead is that if the seed is not set, it should be a 
random number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2269) Add Receiver operating characteristic (ROC) curve evaluation

2015-06-24 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2269:
--

 Summary: Add Receiver operating characteristic (ROC) curve 
evaluation
 Key: FLINK-2269
 URL: https://issues.apache.org/jira/browse/FLINK-2269
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Priority: Minor


[ROC curves|https://en.wikipedia.org/wiki/Receiver_operating_characteristic] 
are a popular way to evaluate the performance of binary classifiers.

We should should support this in our evaluation framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2272) Move vision and roadmap for FlinkML from docs to the wiki

2015-06-24 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2272:
--

 Summary: Move vision and roadmap for FlinkML from docs to the wiki
 Key: FLINK-2272
 URL: https://issues.apache.org/jira/browse/FLINK-2272
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Priority: Trivial
 Fix For: 0.10


The vision and roadmap for FlinkML should be placed on the wiki instead of 
the docs, to allow easier updating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2274) Add a histogram method for DataSet[Double] in DataSetUtils

2015-06-24 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2274:
--

 Summary: Add a histogram method for DataSet[Double] in DataSetUtils
 Key: FLINK-2274
 URL: https://issues.apache.org/jira/browse/FLINK-2274
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Priority: Minor


We should provide descriptive statistics about DataSets containing real numbers.

A histogram can be used to provide an approximation of the distribution of the 
items in the dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2258) Add hyperparameter optimization to FlinkML

2015-06-22 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2258:
--

 Summary: Add hyperparameter optimization to FlinkML
 Key: FLINK-2258
 URL: https://issues.apache.org/jira/browse/FLINK-2258
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis


Hyperparameter optimization is a suite of techniques that are used to find the 
best hyperparameters for a machine learning model, in respect to the 
performance on an independent (test) dataset.

The most common way it is implemented is by using cross-validation to estimate 
the model performance on the test set, and using grid search as the strategy to 
try out different parameters.

In the future we would like to support random search and Bayesian optimisation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2260) Have a complete model evaluation and selection framework for FlinkML

2015-06-22 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2260:
--

 Summary: Have a complete model evaluation and selection framework 
for FlinkML
 Key: FLINK-2260
 URL: https://issues.apache.org/jira/browse/FLINK-2260
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis


This is an umbrella ticket to keep track on the work on the model evaluation 
and selection work for FlinkML. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2245) Programs that contain collect() reported as multiple jobs in the Web frontend

2015-06-19 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2245:
--

 Summary: Programs that contain collect() reported as multiple jobs 
in the Web frontend
 Key: FLINK-2245
 URL: https://issues.apache.org/jira/browse/FLINK-2245
 Project: Flink
  Issue Type: Bug
  Components: Webfrontend
Reporter: Theodore Vasiloudis
Priority: Minor


Currently, if we submit a program (job) that contains calls to collect, we get 
a new job reported in the web fronted for each call to collect.

The expected behaviour when submitting a job is for al the steps in the program 
to be grouped and reported under one job, regardless of the actions inside the 
job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2248) Allow disabling of sdtout logging output

2015-06-19 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2248:
--

 Summary: Allow disabling of sdtout logging output
 Key: FLINK-2248
 URL: https://issues.apache.org/jira/browse/FLINK-2248
 Project: Flink
  Issue Type: Improvement
Reporter: Theodore Vasiloudis
Priority: Minor


Currently when a job is submitted through the CLI we get in stdout all the log 
output about each stage in the job.

It would useful to have an easy way to disable this output when submitting the 
job, as most of the time we are only interested in the log output if something 
goes wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2244) Add ability to start and stop persistent IaaS cluster

2015-06-19 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2244:
--

 Summary: Add ability to start and stop persistent IaaS cluster
 Key: FLINK-2244
 URL: https://issues.apache.org/jira/browse/FLINK-2244
 Project: Flink
  Issue Type: New Feature
Reporter: Theodore Vasiloudis
Priority: Minor


Being able to launch a cluster on GCE/EC2, run some jobs, stop it and restart 
it later, while having persistent storage attached is very useful for 
people/companies that need to run jobs only occasionally.

Currently we can launch a cluster, but it's not possible to stop the instances 
when we don't need them, and then restart them to pick up where we left off.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2247) Improve the way memory is reported in the web frontend

2015-06-19 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2247:
--

 Summary: Improve the way memory is reported in the web frontend
 Key: FLINK-2247
 URL: https://issues.apache.org/jira/browse/FLINK-2247
 Project: Flink
  Issue Type: Improvement
  Components: Webfrontend
Reporter: Theodore Vasiloudis
Priority: Trivial


Currently in the taskmanager view of the web frontend, the memory available to 
Flink is reported in a slightly confusing manner.

In the worker summary, we get a report of the Flink Managed memory available, 
and we get warnings when that is set too low.

The warnings though seem to be not taking the memory available to Flink when 
being issued.

For example, in a machine with 7.5GB memory available it is normal to assign 
~6GB for the JVM, which under default settings gives ~4GB to Flink managed 
memory.

In this case however, we get a warning that 7500MB of memory is available, but 
only ~$4000MB is available to Flink, disregarding the total amount available to 
the JVM.

The reporting can then be improved by reporting the total amount available for 
the JVM, the amount available for Flink's managed memory, and only issue 
warnings when the settings are actually low compared to the available memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2228) Web fronted uses two different timezones when reporting the time for job

2015-06-15 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2228:
--

 Summary: Web fronted uses two different timezones when reporting 
the time for job
 Key: FLINK-2228
 URL: https://issues.apache.org/jira/browse/FLINK-2228
 Project: Flink
  Issue Type: Bug
  Components: Webfrontend
Reporter: Theodore Vasiloudis
Priority: Trivial


When reporting the time of jobs, the time is reported in UTC in the title, but 
in the local time in other places.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Problem with ML pipeline

2015-06-08 Thread Theodore Vasiloudis
I agree with Mikio; ids would be useful overall, and feature selection
should not be a part of learning algorithms,
all features in a LabeledVector should be assumed to be relevant by the
learners.

On Mon, Jun 8, 2015 at 12:00 PM, Mikio Braun mikiobr...@googlemail.com
wrote:

 Hi all,

 I think there are number of issues here:

 - whether or not we generally need ids for our examples. For
 time-series, this is a must, but I think it would also help us with
 many other things (like partitioning the data, or picking a consistent
 subset), so I would think adding (numeric) ids in general to
 LabeledVector would be ok.
 - Some machinery to select features. My biggest concern here for
 putting that as a parameter to the learning algorithm is that this
 something independent of the learning algorith, so every algorithm
 would need to duplicate the code for that. I think it's better if the
 learning algorithm can assume that the LabelVector already contains
 all the relevant features, and then there should be other operations
 to project or extract a subset of examples.

 -M

 On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann till.rohrm...@gmail.com
 wrote:
  You're right Felix. You need to provide the `FitOperation` and
  `PredictOperation` for the `Predictor` you want to use and the
  `FitOperation` and `TransformOperation` for all `Transformer`s you want
 to
  chain in front of the `Predictor`.
 
  Specifying which features to take could be a solution. However, then
 you're
  always carrying data along which is not needed. Especially for large
 scale
  data, this might be prohibitive expensive. I guess the more efficient
  solution would be to assign an ID and later join with the removed feature
  elements.
 
  Cheers,
  Till
 
  On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel sachingoel0...@gmail.com
 wrote:
 
  A more general approach would be to take as input which indices of the
  vector to consider as features. After that, the vector can be returned
 as
  such and user can do what they  wish with the non-feature values. This
  wouldn't need extending the predict operation, instead this can be
  specified in the model itself using a set parameter function. Or
 perhaps a
  better approach is to just take this input in the predict operation.
 
  Cheers!
  Sachin
  On Jun 8, 2015 10:17 AM, Felix Neutatz neut...@googlemail.com
 wrote:
 
   Probably we also need it for the other classes of the pipeline as
 well,
  in
   order to be able to pass the ID through the whole pipeline.
  
   Best regards,
   Felix
Am 06.06.2015 9:46 vorm. schrieb Till Rohrmann 
 trohrm...@apache.org
  :
  
Then you only have to provide an implicit PredictOperation[SVM, (T,
  Int),
(LabeledVector, Int)] value with T : Vector in the scope where you
  call
the predict operation.
On Jun 6, 2015 8:14 AM, Felix Neutatz neut...@googlemail.com
  wrote:
   
 That would be great. I like the special predict operation better
   because
it
 is only in some cases necessary to return the id. The special
 predict
 Operation would save this overhead.

 Best regards,
 Felix
 Am 04.06.2015 7:56 nachm. schrieb Till Rohrmann 
till.rohrm...@gmail.com
 :

  I see your problem. One way to solve the problem is to
 implement a
 special
  PredictOperation which takes a tuple (id, vector) and returns a
  tuple
 (id,
  labeledVector). You can take a look at the implementation for
 the
vector
  prediction operation.
 
  But we can also discuss about adding an ID field to the Vector
  type.
 
  Cheers,
  Till
  On Jun 4, 2015 7:30 PM, Felix Neutatz neut...@googlemail.com
 
wrote:
 
   Hi,
  
   I have the following use case: I want to to regression for a
timeseries
   dataset like:
  
   id, x1, x2, ..., xn, y
  
   id = point in time
   x = features
   y = target value
  
   In the Flink frame work I would map this to a LabeledVector
 (y,
   DenseVector(x)). (I don't want to use the id as a feature)
  
   When I apply finally the predict() method I get a
 LabeledVector
   (y_predicted, DenseVector(x)).
  
   Now my problem is that I would like to plot the predicted
 target
value
   according to its time.
  
   What I have to do now is:
  
   a = predictedDataSet.map ( LabeledVector = Tuple2(x,y_p))
   b = originalDataSet.map(id, x1, x2, ..., xn, y =
 Tuple2(x,id))
  
   a.join(b).where(x).equalTo(x) { (a,b) = (id, y_p)
  
   This is really a cumbersome process for such an simple thing.
 Is
there
  any
   approach which makes this more simple. If not, can we extend
 the
  ML
 API.
  to
   allow ids?
  
   Best regards,
   Felix
  
 

   
  
 



 --
 Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun



[jira] [Created] (FLINK-2186) Reworj SVM import to support very wide files

2015-06-08 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2186:
--

 Summary: Reworj SVM import to support very wide files
 Key: FLINK-2186
 URL: https://issues.apache.org/jira/browse/FLINK-2186
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library, Scala API
Reporter: Theodore Vasiloudis


In the current readVcsFile implementation, importing CSV files with many 
columns can become from cumbersome to impossible.

For example to import an 11 column file wee need to write:

{code}
val cancer = env.readCsvFile[(String, String, String, String, String, String, 
String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
{code}

For many use cases in Machine Learning we might have CSV files with thousands 
or millions of columns that we want to import as vectors.
In that case using the current readCsvFile method becomes impossible.

We therefor need to rework the current function, or create a new one that will 
allow us to import CSV files with an arbitrary number of columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2108) Add score function for Predictors

2015-05-28 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2108:
--

 Summary: Add score function for Predictors
 Key: FLINK-2108
 URL: https://issues.apache.org/jira/browse/FLINK-2108
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Priority: Minor


A score function for Predictor implementations should take a DataSet[(I, O)] 
and an (optional) scoring measure and return a score.

The DataSet[(I, O)] would probably be the output of the predict function.

For example in MultipleLinearRegression, we can call predict on a labeled 
dataset, get back predictions for each item in the data, and then call score 
with the resulting dataset as an argument and we should get back a score for 
the prediction quality, such as the R^2 score.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Some feedback on the Gradient Descent Code

2015-05-28 Thread Theodore Vasiloudis
+1

This separation was the idea from the start, there is trade-off between
having highly configureable optimizers and ensuring that the right types of
regularization can only be applied to optimization algorithms that support
them.

It comes down to viewing the optimization framework mostly as a basis to
build learners upon. We want to give
users the freedom to choose their optimization algorithm when creating for
example a Multiple linear regression learner, but we have to ensure that
the parameters that they
set for the optimizers are valid (e.g. should not be able to set L1
regularization when using L-BFGS).

On Thu, May 28, 2015 at 5:37 PM, Till Rohrmann till.rohrm...@gmail.com
wrote:

 I think so too. Ok, I'll try to update the PR accordingly.

 On Thu, May 28, 2015 at 5:36 PM, Mikio Braun mikiobr...@googlemail.com
 wrote:

  Ah yeah, I see.. .
 
  Yes, it's right that many algorithms perform quite differently
  depending on the kind of regularization... . Same holds for cutting
  plane algorithms which either reduce to linear or quadratic programs
  depending on L1 or L2. Generally speaking, I think this is also not
  surprising as L1 is not differentiable everywhere and you'd have to
  use different regularizations... .
 
  So it probably makes sense to separate the loss from the cost function
  (which is then only defined by the model and the loss function), and
  have the regularization extra.
 
  -M
 
  --
  Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
 



[jira] [Created] (FLINK-2102) Add predict operation for LabeledVector

2015-05-27 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2102:
--

 Summary: Add predict operation for LabeledVector
 Key: FLINK-2102
 URL: https://issues.apache.org/jira/browse/FLINK-2102
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Priority: Minor
 Fix For: 0.9


Currently we can only call predict on DataSet[V : Vector].

A lot of times though we have a DataSet[LabeledVector] that we split into a 
train and test set.

We should be able to make predictions on the test DataSet[LabeledVector] 
without having to transform it into a DataSet[Vector]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2073) Add contribution guide for FlinkML

2015-05-21 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2073:
--

 Summary: Add contribution guide for FlinkML
 Key: FLINK-2073
 URL: https://issues.apache.org/jira/browse/FLINK-2073
 Project: Flink
  Issue Type: New Feature
  Components: Documentation, Machine Learning Library
Reporter: Theodore Vasiloudis
 Fix For: 0.9


We need a guide for contributions to FlinkML in order to encourage the 
extension of the library, and provide guidelines for developers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2072) Add a quickstart guide for FlinkML

2015-05-21 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2072:
--

 Summary: Add a quickstart guide for FlinkML
 Key: FLINK-2072
 URL: https://issues.apache.org/jira/browse/FLINK-2072
 Project: Flink
  Issue Type: New Feature
  Components: Documentation, Machine Learning Library
Reporter: Theodore Vasiloudis
 Fix For: 0.9


We need a quickstart guide that introduces users to the core concepts of 
FlinkML to get them up and running quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2056) Add guide to create a chainable predictor in docs

2015-05-20 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2056:
--

 Summary: Add guide to create a chainable predictor in docs
 Key: FLINK-2056
 URL: https://issues.apache.org/jira/browse/FLINK-2056
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
Priority: Minor
 Fix For: 0.9


The upcoming API for pipelines should have good documentation to guide and 
encourage the implementation of more algorithms.

For this task we will create a guide that shows how the pipeline mechanism 
works through Scala implicits, and a full guide to implementing a chainable 
predictor, using Generalized Linear Models as an example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2047) Rename CoCoA to SVM

2015-05-19 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2047:
--

 Summary: Rename CoCoA to SVM
 Key: FLINK-2047
 URL: https://issues.apache.org/jira/browse/FLINK-2047
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
Priority: Trivial
 Fix For: 0.9


The CoCoA algorithm as implemented functions as an SVM classifier.

As CoCoA mostly concerns the optimization process and not the actual learning 
algorithm, it makes sense to rename the learner to SVM which users are more 
familiar with.

In the future we would like to use the CoCoA algorithm to solve more large 
scale optimization problems for other learning algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2035) Update 0.9 roadmap with ML issues

2015-05-18 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2035:
--

 Summary: Update 0.9 roadmap with ML issues
 Key: FLINK-2035
 URL: https://issues.apache.org/jira/browse/FLINK-2035
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis


The [current 
list|https://issues.apache.org/jira/browse/FLINK-2001?jql=project%20%3D%20FLINK%20AND%20fixVersion%20%3D%200.9%20AND%20component%20%3D%20%22Machine%20Learning%20Library%22]
 of issues linked with the 0.9 release is quite limited.

We should go through the current ML issues and assign fix versions, so that we 
have a clear view of what we expect to have in 0.9.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2034) Add vision and roadmap for ML library to docs

2015-05-18 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2034:
--

 Summary: Add vision and roadmap for ML library to docs
 Key: FLINK-2034
 URL: https://issues.apache.org/jira/browse/FLINK-2034
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis


We should have a document describing the vision of the Machine Learning library 
in Flink and an up to date roadmap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2015) Add ridge regression

2015-05-14 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2015:
--

 Summary: Add ridge regression
 Key: FLINK-2015
 URL: https://issues.apache.org/jira/browse/FLINK-2015
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Priority: Minor


Ridge regression is a linear regression model that imposes penalties on the 
size of the coefficients.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2014) Add LASSO regression

2015-05-14 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2014:
--

 Summary: Add LASSO regression
 Key: FLINK-2014
 URL: https://issues.apache.org/jira/browse/FLINK-2014
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis


LASSO is a linear model that uses L1 regularization



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2016) Add elastic net regression

2015-05-14 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2016:
--

 Summary: Add elastic net regression
 Key: FLINK-2016
 URL: https://issues.apache.org/jira/browse/FLINK-2016
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Priority: Minor


[Elastic net|http://en.wikipedia.org/wiki/Elastic_net_regularization] is a 
linear regression method that combines L2 and L1 regularization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2003) Building on some encrypted filesystems leads to File name too long error

2015-05-12 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2003:
--

 Summary: Building on some encrypted filesystems leads to File 
name too long error
 Key: FLINK-2003
 URL: https://issues.apache.org/jira/browse/FLINK-2003
 Project: Flink
  Issue Type: Bug
  Components: Build System
Reporter: Theodore Vasiloudis
Priority: Minor


The classnames generated from the build system can be too long.
Creating too long filenames in some encrypted filesystems is not possible, 
including encfs which is what Ubuntu uses.

This the same as this [Spark 
issue|https://issues.apache.org/jira/browse/SPARK-4820]

The workaround (taken from the linked issue) is to add in Maven under the 
compile options: 

{code}
+  arg-Xmax-classfile-name/arg
+  arg128/arg
{code}

And in SBT add:

{code}
+scalacOptions in Compile ++= Seq(-Xmax-classfile-name, 128),
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1995) The Flink project is categorized under Incubator in the Apache JIRA tracker

2015-05-08 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-1995:
--

 Summary: The Flink project is categorized under Incubator in the 
Apache JIRA tracker
 Key: FLINK-1995
 URL: https://issues.apache.org/jira/browse/FLINK-1995
 Project: Flink
  Issue Type: Bug
Reporter: Theodore Vasiloudis
Priority: Trivial


If you visit https://issues.apache.org/jira/browse/flink you can see that the 
Category for Flink (next to Lead) still says Incubator. 
I guess this should be fixed so that Flink gets its own category like the rest 
of the TLPs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1965) Implement the Orthant-wise Limited Memory QuasiNewton optimization algorithm, a variant of L-BFGS that handles L1 regularization

2015-04-30 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-1965:
--

 Summary: Implement the Orthant-wise Limited Memory QuasiNewton 
optimization algorithm, a variant of L-BFGS that handles L1 regularization
 Key: FLINK-1965
 URL: https://issues.apache.org/jira/browse/FLINK-1965
 Project: Flink
  Issue Type: Wish
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Priority: Minor


The Orthant-wise Limited Memory QuasiNewton (OWL-QN) is a quasi-Newton 
optimization method similar to L-BFGS that can handle L1 regularization. 

Implementing this would allow us to obtain sparse solutions while at the same 
time having the convergence benefits of a quasi-Newton method, when compared to 
stochastic gradient descent.

[Link to 
paper|http://research.microsoft.com/en-us/downloads/b1eb1016-1738-4bd5-83a9-370c9d498a03/]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1960) Add comments and docs for withForwardedFields and related operators

2015-04-29 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-1960:
--

 Summary: Add comments and docs for withForwardedFields and related 
operators
 Key: FLINK-1960
 URL: https://issues.apache.org/jira/browse/FLINK-1960
 Project: Flink
  Issue Type: Improvement
  Components: docs, Documentation
Reporter: Theodore Vasiloudis
Priority: Minor


The withForwardedFields and related operators have no docs for the Scala API. 
It would be useful to have code comments and example usage in the docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1901) Create sample operator for Dataset

2015-04-16 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-1901:
--

 Summary: Create sample operator for Dataset
 Key: FLINK-1901
 URL: https://issues.apache.org/jira/browse/FLINK-1901
 Project: Flink
  Issue Type: Improvement
  Components: Core
Reporter: Theodore Vasiloudis


In order to be able to implement Stochastic Gradient Descent and a number of 
other machine learning algorithms we need to have a way to take a random sample 
from a Dataset.

We need to be able to sample with or without replacement from the Dataset, 
choose the relative size of the sample, and set a seed for reproducibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1889) Create optimization framework

2015-04-14 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-1889:
--

 Summary: Create optimization framework
 Key: FLINK-1889
 URL: https://issues.apache.org/jira/browse/FLINK-1889
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis


In order to implement Stochastic Gradient Descent and other optimization 
algorithms, we need an interface structure that the algorithms should comply to.

We can then use that structure to implement the various algorithms.

The purpose of this issue is to act as a root for the specific implementation 
of the optimization algorithms, and to discuss the design of the optimization 
package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)