On 2 August 2012 07:02, Gael Varoquaux <[email protected]>wrote:
> Hi list,
>
> **Warning** This is a _long_ mail. Probably way too long, as people's
> attention is going to drop. The fact that it's that long probably
> expresses how confused I am.
> Core developers, please do read it: it's about a PR on which someone has
> been putting many, many, hours.
>
> This long running pull request on which the author has been putting a lot
> of
> effort is the Kalman filter pull request:
> https://github.com/scikit-learn/scikit-learn/pull/862
>
> I have been spending quite a while looking at this code and trying to
> come up with a fair review and guidelines on how to integrate it in the
> scikit. While several of us have given low-level feedback on things like
> coding style, I must confess that I am not completely happy with the big
> picture. I'd like to have a high level discussion on the mailing list on
> how such a codebase can be well integrated in the scikit, if it can.
>
> The code as it currently is, does not feel very usable in the settings of
> the scikit-learn, based on simple APIs and mostly prediction or
> transformation.
>
> Antipatterns that I see
> ========================
>
> a. The Kalman filter parameters pretty much have to be specified. Learning
> from data is theoretically possible but:
>
> 1. It takes a lot of time (probably fixable using spectral algorithms
> http://www.cs.cmu.edu/~ggordon/spectral-learning/boots-slides.pdf )
>
> 2. The current parametrisation is not natural for this purpose
> (fixable, but requires more understanding than I currently have)
>
> 3. In my experience, the current implementation fails to learn
> reasonnable parameters on what seems like simple problems.
>
> Specifying the parameters seems to me very problematic. Indeed, as can
> be seen from the example, the parameters to specify can hardly be
> guessed from vague prior knowledge and probably require some
> understanding of the theory behind the Kalman filter and some
> pen and paper work. This is fairly "unscikity".
>
> b. The current object can hardly work outside of the training samples.
> While the contributed implementation catters for missing data, which is
> a precious feature in itself, it can only compute out of sample
> prediction for very few points (i.e. point neighboring given data
> points). This is a property of the Kalman, I believe. It is not a
> show-stopper, in itself, but it limits the usefulness of the code in
> the scikit.
>
> Designing an 'Estimator' API for Kalman filters
> ===============================================
>
> To merge the code in the scikit-learn, it has to implement an estimator
> interface that enables non experts to use it as much as possible to solve
> typical problems that the scikit-learn tackles. I am struggling on how to
> do this best, as I don't know Kalman filters very well, and have never
> used them on real problem.
>
> It seems to me that the question becomes: of do we do 'transform' or
> 'predict'. Given an object that does some form of data processing, this
> is immediately what I may want to do.
>
> Kalman filters can do prediction, in the sens of extrapolation, but that
> will work only for a small number of time points, so I think that we can
> set that aside for a while.
>
> In my eyes, Kalman filters can do data transforms, in two ways. First,
> they can do filtering, and that's probably the most natural and obvious.
> Second, they can output the state space, thus increasing the
> dimensionality of the feature space.
>
> The real challenge, is that the Kalman filters have a notion of
> dependence across the samples. As long as we are with one continuous
> measurement vector, things are simple, but as soon as we start giving new
> observations, we may want to related them to those previously seen. We
> will most probably break cross-validation. This is a general problem that
> we will have will all models having some notion of time series.
>
> To merge or not to merge?
> ===========================
>
> Reasons to merge
> -----------------
>
> 1. Kalman might come in handy to do feature transform when working with
> time serie data. However, I don't do this myself, so I don't know to
> design an API to make that possible.
>
> 2. The contributor is a fighter.
>
> 3. We already have HMM, and Kalman directly relate to HMM (it seems to me
> that HMMs have problem b but not problem a, currently)
>
> Reasons not to merge
> ---------------------
>
> 1. "Antipattern" a (necessity to specificy complex model parameters) is
> really a killer for me. In the current situation, I find that it
> limits the usefulness of the code. That said, to go beyond the
> problems probably requires i) a reparametrisation of the problem, to be
> able to specify things like dimensionality of the state space ii)
> applying regularization, which might be beyond the contributor's initial
> goals.
>
> 2. As a community, we do not really have the knowledge to maintain this
> codebase if the contributor goes MIA (missing in action). I'd like to
> be somewhat convinced that the guy is going to use it in something
> close to 'production' settings.
>
> 3. Scikits.statmodels has already other algorithms for Kalman filters.
> Maybe the problem would fall better in the corresponding API and
> usecases.
>
> 4. If we go down that path, we must start thinking about how time-series
> should be supported in scikit-learn. I am not at all opposed to that.
> Actually quite thrilled. However, we need to keep in mind that it will
> make many things more complicated. For this to be an option, I think
> that we need active developers with these usecases and that are ready
> to invest significant efforts in this direction. If not, I am afraid
> that it will remain whishfull thinking.
>
> I'd like to stress that I am not a believer in the strategy of merging as
> many features as possible without worrying of how they fit together. The
> bigger the scikit becomes, the harder it becomes to maintain it, and to
> give a clear picture to our users. In my experience, a project should be
> driven by a 'vision', that is simple to explain to potential users and
> guides technical and API choices [*].
>
>
> So, should we merge or not? In a better world, I would probably say that
> such a code should be in a 'scikit-signal', and not 'scikit-learn'.
> However, there is no scikit-signal.
>
> I'd like a discussion, so that we can give the guy a clear feedback.
>
> Thanks for your input!
>
> Gael
>
> [*] http://jamesshore.com/Agile-Book/vision.html
>
> http://www.rastinmehr.com/2009/09/14/does-your-software-project-have-a-vision-and-design-philosophy/
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
Precursor: I say this as someone who knows nothing about Kalman filters.
I don't think the parameterisation issue is too big of a deal. The
algorithm works, in some cases at least, well enough to be a 'well-known'
algorithm, and I think therefore it should be included.
Many of the algorithms don't work well with default parameters on arbitrary
datasets (think the DBSCAN, which requires the eps parameter to be
carefully set, lest everything end up in one (or n_samples) clusters).
I would say, rather than try make everything "beginner friendly", we denote
*some* algorithms as "beginner friendly", and everything not marked is
assumed to require some knowledge to run.
That's my thoughts, hope it helps.
- Robert
--
Public key at: http://pgp.mit.edu/ Search for this email address and select
the key from "2011-08-19" (key id: 54BA8735)
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general