On Sun, 2018-08-19 at 17:16 -0400, James Hirschorn wrote: > I plan to try it out myself, but I wanted to check here if running > applyStrategy in a loop, while looping over different dates, will > work? I could not find any examples of this. > > There are 2 reasons for wanting to do this: First of all, one could > have a couple of years of tick data, which is too big to fit in > memory for each symbol. Of course, I am assuming that the orders > placed by the strategy are sparse enough so that the order_book > generated by applyStrategy can still fit in memory. > > The second reason is that if this loop could moreover be run in > parallel, then there could potentially be a 500x speed up for two > years of data.
James, The answer is 'it depends'. There is a parallel version of applyStrategy in the sandbox on github. I haven't touched it in several years, so I wouldn't trust that code. I mention it as an example of what is theoretically possible. A better example, which is already parallelized and much more highly utilized, is apply.paramset(). First, to expand on Ilya's answer, let's talk about what *is* possible. It is possible to wrap a foreach loop over applyStrategy that would separate symbols to different workers (though your hypothesized 500x speedup would require *at least* 500 worker nodes, spread out over several physical machines, using something like doRedis, which we have tested up to around 200 workers). This assumes that each symbol is completely independent, and that there is no interaction on things like trade sizing or capital or risk among the symbols. The simplest way to do this would be to create separate portfolios per symbol, so that each worker is completely independent. See examples of a different kind of splitting and parallelization in appply.paramset() (which is also used in walk forward testing). It is also possible, and we commonly do this, to segment the dates that you want to run applyStrategy over. As you hypothesized, a simple loop over date regions, loading different non-conflicting time series, may be applied to successively run each date range. This, as you noted, works well when even 64, 128, or 512GB+ of RAM is not enough for all of your data. We've made a number of changes over the years to make quantstrat more memory efficient, but copies are still made when unavoidable, state is kept between the various nested apply* functions, and RAM use basically grows throughout the run of a strategy evaluation. So segmenting the use of market data by Dates can help, though you may need to discard some intermediary results (like portions of the order book) to make everything fit. In the first example of parallelizing by symbol, RAM is your most likely issue still, since even very large machines rarely have more than about 16GB per core/thread. You still have some wrinkles here. Again, you need to assess whether there is any interaction. Transactions cannot be added to a portfolio out of order, as the P&L is (potentially) dependent on prior transactions. So you may again need to create multiple portfolios and stitch the different period P&L together yourself. So, in the 'don't do that' camp, don't try to apply transactions out of order, the trade blotter won't allow it. In the 'should work' camp are several variations of splitting your computational problem so that it is amendable to looping and/or parallelization, described above. Regards, Brian -- Brian G. Peterson http://braverock.com/brian/ Ph: 773-459-4973 IM: bgpbraverock _______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. If you want to post, subscribe first. -- Also note that this is not the r-help list where general R questions should go.
