Dear R-Devels,

I am designing right now a package intended to simplify the handling of market 
microstructure data (tick data, order data, etc). As these data is most times 
pretty huge and needs to be reordered quite often (e.g. if several security 
data is batched together or if only a certain time range should be considered) 
- the package needs to handle this. 

Before I start, I would like to mention some facts which made me decide to 
construct an own package instead of using e.g. the packages bigmemory, 
highfrequency, zoo or xts: AFAIK big memory does not provide the opportunity to 
handle data with different types (timestamp, string and numerics) and their 
appropriate sorting, for this task databases offer better tools. Package 
highfrequency is designed to work specifically with a certain data structure 
and the data in market microstructure has much greater versatility. Packages 
zoo and xts offer a lot of versatility but do not offer the data sorting 
ability needed for such big data. 

I would like to get some feedback in regard to my decision and in regard to the 
short design overview following.  
  
My design idea is now:

1. Base the package on S4 classes, with one class that handles data-reading 
from external sources, structuring and reordering. Structuring is done in 
regard to specific data variables, i.e. security ID, company ID, timestamp, 
price, volume (not all have to be provided, but some surely exist on market 
microstructure data). The less important variables are considered as a slot 
@other and are only ordered in regard to the other variables. Something like 
this:

.mmstruct <- setClass('mmstruct', representation(
                        name    = "character",
                        index   = "array",
                        N               = "integer",
                        K               = "integer",
                        compiD  = "array",
                        secID   = "array",
                        tradetime       = "POSIXlt",
                        flag            = "array",
                        price   = "array",
                        vol             = "array",
                        other   = "data.frame"))

2. To enable a lightweight ordering function, the class should basically create 
an SQLite database on construction and delete it if 'rm()' is called. 
Throughout its life an object holds the database path and can execute queries 
on the database tables. By this, I can use the table sorting of SQLite (e.g. by 
constructing an index for each important variable). I assume this is faster and 
more efficient than programming something on my own - why reinventing the 
wheel? For this I would use VIRTUAL classes like:

.mmstructBASE   <- setClass('mmstructBASE', representation(
                                        dbName          = "character",
                                        dbTable         = "character"))

.mmstructDB             <- setClass('mmstructDB', representation(
                                        conn            = "SQLiteConnection"),
                                        contains                = 
c("mmstructBASE"))

.mmstruct <- setClass('mmstruct', representation(
                        name    = "character",
                        index   = "array",
                        N               = "integer",
                        K               = "integer",
                        compiD  = "array",
                        secID   = "array",
                        tradetime       = "POSIXlt",
                        price   = "array",
                        vol             = "array",
                        other   = "data.frame"),
                        contains = c("mmstructDB"))

The slots in the mistrust class hold then a view (e.g. only the head()) of the 
data or can be used to hold retrieved data from the underlying database. 

3. The workflow would than be something like:   a) User reads in the data from 
an external source and gets a data.frame from it. 
                                                                                
        b) This data.frame then can be used to construct an mmstruct object 
from it by formatting the variables and read them into the SQLite database 
constructed. 
                                                                                
        c) Given the data structure in the database, the user can sort the data 
by secID, timestamp etc. and can use several algorithms for cleaning the data 
(package-specific not in the database) 
                                                                                
        d) Example: The user makes a query to get only price from entries 
compID = "AA" with tradetime < "2012-03-09" or with trade time only first 
trading day in a month. This can then be converted e.g. to a 'ts' object in R 
by coercing 
                                                                                
        e) In addition the user can perform several estimations of market 
microstructure models by calling package-specific functions. 


Is there a big fault in my design, something I haven't considered? I am very 
sure on this list are researchers and developers with much more experience. I 
would like to hear your opinion and ideas. I learn from it and can maybe get to 
a design which I can then implement for the research on such data and models. 


Best

Simon



 
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to