On Oct 22, 2007, at 2:54 PM, Thomas Lumley wrote: > > I am trying to use RSQLite for storing data and I need to create > indexes on > two variables in the table. It appears from searching the web that > the CREATE > INDEX operation in SQLite is relatively slow for large files, and > this has been > my experience as well. > > The two index variables are crossed. One has about 350,000 levels > [yes, it's > genetic association data]. The other will have about 4000 levels > eventually, > but is up to about 100 now. When the data were entered they were > already ordered by this second index variable. > > Creating the index took about an hour on the 100-level, presorted > variable and about 12 hours on the 350,000-level unsorted > variable. I'm looking for advice on how to reduce this. Specifically > 1/ would it be faster if the variable with more levels was the > presorted one? > 2/ would it be faster or slower if the index were created before > adding all the data? > 3/ are there any options that can be set to speed up the indexing? > > The SQLite database will not be the primary archive for the data, > so optimizations that are risky in the case of power loss or > hardware failure are still acceptable. Since Bioconductor seems to > use SQLite a lot I'm hoping there is some simple solution.
I have not used RSQLite, but have some experience doing this thing for big sqlite databases using the command line client. Every database in sqlite has a number of parameters associated with it. You want to make sure that cache_size is at _most_ 2000 (yes, I know this is totally counterintuitive as it tells sqlite to use as little memory as possible). You also tell it to be non-synchronous. In sqlite the commands are sqlite> pragma default_cache_size = 2000 sqlite> pragma_synchronous = off You can test the setting of these parameters by just doing a sqlite> pragma default_cache_size As far as I remember, cache size can only be set when you create the database. I have no idea how RSQlite handles it. When I asked about this problem on the sqlite mailing list, the sqlite-creator said that this was a "locality of reference problem" and that it was being "worked on". And that I could search the archives for more info (which did not help me back then). I don't know whether or not sorting helps. Another thing to do is to check in what amount sqlite sits idle while doing I/O. It is probably impossible to avoid some idleness with such a thing, but it should of course be kept to a minimum. It is true that some of the other databases are probably much faster at creating indices. But in the post-index analysis, sqlite is a really fast database, probably amongst the fastest there is. It does not do a good job a converting your queries into smart queries, but if you are doing something simple, it is blazingly fast with the right user options. Kasper ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel