Hello,
I have been attempting to speed up some code by using an sqlite database, but I'm not getting the performance gains I expected.

The use case:

I have text files containing data which may or may not include a header in the first line. Each line (other than the header) is a record, so all lines (when split on the relevant separator) should contain the same number of values. I need to generate new files in a very specific format; space separated, with header removed and integer codes substituted for the values in the parent file. e.g. If the first column (i.e. [line.strip('\r\n').split()[0] for line in file]) contained 15 distinct strings, then I would substitute the values in the parent file with integers from 0 to 14 in the new file. The new file would contain a non-empty subset of the 'columns' in the original file, and might be conditioned on particular values of other columns.

My first effort read the parent file and generated a similar file containing integer codes. New files were generated by iterating over the lines of the file containing integer codes, splitting them, doing the required selection and conditioning via list comprehensions, joining the resultant lists, and writing to a new file. My test file has 67 columns and over a million records, and just creating the file of integers took a few minutes. (I also need to check for empty lines and skip them, and check for records of incorrect length.)

I have partially implemented an alternative approach where I write the data to an sqlite database. The idea is that I will add extra columns for the integer codes and insert the integer codes only when required for a new file. But I've been immediately hit with the cost of inserting the data into the database. It takes around 80 seconds (compared to the 35 seconds needed to parse the original file and skip empty lines and check the record lengths). I have tried iterating over the records (lists of strings generated by csv.reader) and inserting each in turn. I have also tried executemany() passing the csv.reader as the second argument. I have also tried executing "PRAGMA synchronous=OFF". It still takes around 80 seconds.

I'm a bit rusty with SQL, so I'd appreciate any advice on how to speed this up. I seem to remember (using MySQL years ago) that there was a way of dumping data in a text file to a table very quickly. If I could do this and do my data integrity checks afterwards, then that would be great. (Dumping data efficiently to a text file from an sqlite table would also be handy for generating my new files.) Alternatively, if I could substantially speed up the inserts then that would be great. Any advice appreciated. TIA.

Duncan
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to