That's something I've wanted to discuss for a while and I'd better do it
before the new framework is set in stone.

I could make this really long so I'll try to keep it as concise as I can ;)

One main motivation for the ARL framework was the idea of working with a
"linear" data flow, compared to encapsulating objects, e.g.
    Standardize -> PCA -> NNet
instead of
    ChainedLearner(ChainedLearner(Standardize, PCA), NNet)

I don't recall exactly the original motivations. What I've personally
found nice about such a linear approach is:
    - it's easier to read, because you have all (or at least big chunks)
      of your processing chain in a single Python list
    - it's easier to add / remove component from such a list

What I didn't like in the first version of this framework is mainly that
the linear approach is not consistently used, mostly when it comes to
splitting the data and doing a typical train/evaluation setup on these
splits. I've also had to use ugly hacks in some situations where some
element in the chain had to access something computed by another element
that was not the one just before, which is when linear stuff gets ugly.

So, I thought it would be nice if the new framework would let us do
things really smoothly in a linear way, typically in the form:
    Dataset -> Preproc -> Split -> Learner -> Evaluator
but of course, without enforcing such a rigid structure.

And I think I came up with a cool solution, which consists in using
stacks so that some object along the line can re-use more than just the
object before, so for instance you could write:
    Dataset1 -> Dataset2 -> ConcatDataset
and ConcatDataset would know it should pop the last two things in the
chain (note how I'm also suggesting that datasets could be used directly
as primitive building blocks, but that's another topic) But thinking
about it more, I decided it was a really bad idea, and a single word
should be enough to convince you: VMatLanguage.

So, my current opinion is that there is no clear winner between
linearization vs. encapsulation, and the best solution may be to use
both: linearization for "Markovian" chains (i.e. one component depends
only on what's been computed by the previous one), and encapsulation for
stuff that is more complex. E.g:
    ConcatSet(Set1, Set2) -> PCA -> NNet

As far as splitting / testing is concerned, I still think it'd be nice
to have it linearized, because it looks better, and I believe
encapsulation may not be required. But I don't know what's Xavier plan
for this. So I'll leave it for future discussion.

So, the feedback I'm looking for about this email is whether you guys
think we should try to linearize the data flow as much as possible (and
then find ways to keep things consistent and clear when the flow is not
naturally linear), or mix in some encapsulation (and then suggest how,
especially if you don't agree with me ;)


Mailing list:
Post to     :
Unsubscribe :
More help   :

Reply via email to