sqlite INSERT performance

duncan smith Wed, 30 May 2012 19:03:15 -0700

Hello,

I have been attempting to speed up some code by using an sqlitedatabase, but I'm not getting the performance gains I expected.


The use case:

I have text files containing data which may or may not include a headerin the first line. Each line (other than the header) is a record, so alllines (when split on the relevant separator) should contain the samenumber of values. I need to generate new files in a very specificformat; space separated, with header removed and integer codessubstituted for the values in the parent file. e.g. If the first column(i.e. [line.strip('\r\n').split()[0] for line in file]) contained 15distinct strings, then I would substitute the values in the parent filewith integers from 0 to 14 in the new file. The new file would contain anon-empty subset of the 'columns' in the original file, and might beconditioned on particular values of other columns.

My first effort read the parent file and generated a similar filecontaining integer codes. New files were generated by iterating over thelines of the file containing integer codes, splitting them, doing therequired selection and conditioning via list comprehensions, joining theresultant lists, and writing to a new file. My test file has 67 columnsand over a million records, and just creating the file of integers tooka few minutes. (I also need to check for empty lines and skip them, andcheck for records of incorrect length.)

I have partially implemented an alternative approach where I write thedata to an sqlite database. The idea is that I will add extra columnsfor the integer codes and insert the integer codes only when requiredfor a new file. But I've been immediately hit with the cost of insertingthe data into the database. It takes around 80 seconds (compared to the35 seconds needed to parse the original file and skip empty lines andcheck the record lengths). I have tried iterating over the records(lists of strings generated by csv.reader) and inserting each in turn. Ihave also tried executemany() passing the csv.reader as the secondargument. I have also tried executing "PRAGMA synchronous=OFF". It stilltakes around 80 seconds.

I'm a bit rusty with SQL, so I'd appreciate any advice on how to speedthis up. I seem to remember (using MySQL years ago) that there was a wayof dumping data in a text file to a table very quickly. If I could dothis and do my data integrity checks afterwards, then that would begreat. (Dumping data efficiently to a text file from an sqlite tablewould also be handy for generating my new files.) Alternatively, if Icould substantially speed up the inserts then that would be great. Anyadvice appreciated. TIA.


Duncan
--
http://mail.python.org/mailman/listinfo/python-list

sqlite INSERT performance

Reply via email to