On Oct 22, 2007, at 6:07 PM, Kasper Daniel Hansen wrote: > On Oct 22, 2007, at 2:54 PM, Thomas Lumley wrote: > >> >> I am trying to use RSQLite for storing data and I need to create >> indexes on >> two variables in the table. It appears from searching the web that >> the CREATE >> INDEX operation in SQLite is relatively slow for large files, and >> this has been >> my experience as well. >> >> The two index variables are crossed. One has about 350,000 levels >> [yes, it's >> genetic association data]. The other will have about 4000 levels >> eventually, >> but is up to about 100 now. When the data were entered they were >> already ordered by this second index variable. >> >> Creating the index took about an hour on the 100-level, presorted >> variable and about 12 hours on the 350,000-level unsorted >> variable. I'm looking for advice on how to reduce this. Specifically >> 1/ would it be faster if the variable with more levels was the >> presorted one? >> 2/ would it be faster or slower if the index were created before >> adding all the data? >> 3/ are there any options that can be set to speed up the indexing? >> >> The SQLite database will not be the primary archive for the data, >> so optimizations that are risky in the case of power loss or >> hardware failure are still acceptable. Since Bioconductor seems to >> use SQLite a lot I'm hoping there is some simple solution. > > I have not used RSQLite, but have some experience doing this thing > for big sqlite databases using the command line client. Every > database in sqlite has a number of parameters associated with it. You > want to make sure that cache_size is at _most_ 2000 (yes, I know this > is totally counterintuitive as it tells sqlite to use as little > memory as possible). You also tell it to be non-synchronous. In > sqlite the commands are > sqlite> pragma default_cache_size = 2000 > sqlite> pragma_synchronous = off > You can test the setting of these parameters by just doing a > sqlite> pragma default_cache_size > As far as I remember, cache size can only be set when you create the > database. I have no idea how RSQlite handles it. > > When I asked about this problem on the sqlite mailing list, the > sqlite-creator said that this was a "locality of reference problem" > and that it was being "worked on". And that I could search the > archives for more info (which did not help me back then). > > I don't know whether or not sorting helps. > > Another thing to do is to check in what amount sqlite sits idle while > doing I/O. It is probably impossible to avoid some idleness with such > a thing, but it should of course be kept to a minimum. > > It is true that some of the other databases are probably much faster > at creating indices. But in the post-index analysis, sqlite is a > really fast database, probably amongst the fastest there is. It does > not do a good job a converting your queries into smart queries, but > if you are doing something simple, it is blazingly fast with the > right user options.
Let me just emphasize (based on Seth's email) that the default_cache_size settings only is for an indexing command. For actual operations on the database using the index, like a select statement, you would want to increase the cache_size to something bigger. My comments are based on experience with a database with 315 *10^6 rows, and the importance of this is quite dependent on database size. Kasper ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel