This is a good point and it helps to further clarify the traps involved. Of course both your training set and your test data should be representative of THE market. How do you know that? you don't. You can identify this for english vs. japanese easily, but with the market you never know. There are some approaches, however, to minimize the risk (note, however, that I say: minimize the risk not ensure or such nonsense...) in machine learning there is an approach called n-fold cross- validation (can easily be looked up). The idea is: you take all your data and partition it in n different ways into training set and test set. (e.g., n=10: you first partition it into 10 identical segments, then you use 1..9 for training, 10 for testing, then 1..8+10 for training, 9 for testing, .. until 1 for testing and 2..9 training). A strategy is considered good, if for all combinations it yields good results on the test set (not the training set). Of course it will always learn different values in all cases, but the idea is: it needs to be differentiated between the parameter settings and the strategy itself. Even for the worst strategy you may be able to find successful parameter settings so that on one specific data set (on which it was trained) it works. (You can see this often in the heat map: there is a small hot area and a fast area where the results are bad.) A good strategy, however, should be less dependent on the parameter settings - or to put it differently - in the cross-validation setting: the parameters should not jump around too much among the different combinations.
Going back to the analogy: what it means is. If what I have to learn (because it is my data) is both english and japanese, then for most combinations I will have both english and Japanese bits in the training and thus a good strategy will be successful on the test data. In a few unlucky combinations (at most possible for one combination as the relative size of training vs. test is 1:9) I will only see one language in training and see only the other in the tests. Unfortunately cross validation is cumbersome to do manually, especially as data is added each week. Thus, I use a somewhat different approach: I have some data sets classified as representing a flat market, an up-market, a down- market, ... .. and this is one of the reasons why I did the batch processor: once I have a strategy, I do test it on such a batch of market situations to see whether the strategy (along with parameter settings) does well in these market conditions and the extension allows me to test it in one go instead of presenting it manually (of course I want the results per market condition). Of course my way of classifying market conditions may focus on the wrong criteria and thus it does not give me the kind of assurance I would expect. This is a risk and there n-fold cross-validation is certainly better. Finally, it needs to be considered (along the lines of nonlinear's argument: what happens if all data I have now is English data, so I do not have anything else for my cross-validation. And then next I get a Japanese text.. - it's clear: unless I have a crystal ball to predict the future, there is nothing I can ever do about it: if I have never seen anything like this, I have no chance to learn about it. The basic idea of all learning is: I try to make the best use of the hidden information in the data I have seen so far to generalize it and make predictions for the future. But if s.th. is not there, I cannot learn it and generalize it. So: no chance. What can be said, however, is: training vs. test partitioning is way better than using all data for training due to the memorization problem and of course cross-validation is way better than a single, fixed partitioning. (And as I have been a bit out of touch for years with the ML community, there might be by now even better ideas.) Please note, this holds only for set of data I have available at a specific point in time. (Sorry for the loooong post.. :) Klaus On 10 Dez., 06:52, nonlinear5 <[email protected]> wrote: > > It is like with people, if you really want someone to understand > > something, you will teach him (presenting examples > > is one approach for teaching). But at the end you want to know whether > > he really understood (i.e., got the principles > > and is able to use them to solve knew problems) or whether he just > > memorized. The only way to find this out is to show him s.th. he has > > not seen before... > > That's a good analogy, and it does demonstrate the benefits of the out- > of-sample testing. However, it also illuminates the trap. Let's take > your example and modify it a little. Let's say you are teaching a > child to read in English. She made a good progress thus far, so you > decide to test whether she *really* learned how to read. You present a > test, which happens to be a Japanese poetry piece. Naturally, the > child gets an "fail" on that test, and so do you as a teacher. > > The same thing may happen with your "in-sample, out of sample" > approach. If your in-sample is too short, your system may learn the > patterns while the market was "speaking" English, and apply those > patterns when the market shifted to speaking Japanese. -- You received this message because you are subscribed to the Google Groups "JBookTrader" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/jbooktrader?hl=en.
