Re: Bulk inserts
On Wed, Jun 11, 2008 at 7:20 PM, Contact 42 [EMAIL PROTECTED] wrote: Jonathan Vanasco wrote: write to a .txt log in some sort of standardized format , when it hits 10k lines run a batch query what happens if an insert fails ? It continues with the rest of the batch. It'll tell you at the end how many succeeded and how many failed. You can read the MySQL documentation to figure out how to look up the error messages. My schema is setup so that I don't get failures unless something really awful happens. -jj -- I, for one, welcome our new Facebook overlords! http://jjinux.blogspot.com/ --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups pylons-discuss group. To post to this group, send email to pylons-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/pylons-discuss?hl=en -~--~~~~--~~--~--~---
Re: Bulk inserts
Depending on the locking behaviour of the bulk loader it could make sense to load the data into simple preprocessing tables which don't have any validation logic or primary key constraints. From these tables the final processing is then done. This would move the error handling from the bulk loader into the procedures inside the database, and solve bulk loading specific issues. We have also used this method to process ~10K record batch imports from a CSV file that is uploaded through a web interface. We were using MSSQL, but the theory would be the same and has worked very well for us. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups pylons-discuss group. To post to this group, send email to pylons-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/pylons-discuss?hl=en -~--~~~~--~~--~--~---
Re: Bulk inserts
one of the things i HATE about mysql, is the 5.0 branch , nothing failed. they specifically wanted everything to go though, and valued that behavior over integrity. in terms of what happens when something does fail now? I'm pretty sure the mysql and postgres docs have workarounds/solutions for that. i've never had to code for that , as i've only dealt with data dumps. i meant to add in my original post- you can use a log rotation system to just change the 'write to' file every set interval, and then work off that. i think there is a python class that handles that automagically. i didn't think of the threads issues myself. locking would need to be done as mentioned above. two things i'd bring up: - doing something similar in mod_perl, i've had the logging done in a 'cleanup' handler that runs after the connection closes. this can often free up some blocking issues. - IIRC, a friend tried something similar where his apps were posting the data to an central twisted daemon on his lan. that daemon then handled all of the logging , rotating, and misc management. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups pylons-discuss group. To post to this group, send email to pylons-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/pylons-discuss?hl=en -~--~~~~--~~--~--~---
Re: Bulk inserts
One other option might be to have each request dump its data to a random file with a .tmp extension to a queue directory. When the process is done, it can rename the file (which should be atomic) to a .csv file. Then the controller exits. You then have a second worker process checking the queue directory for .csv files and either a) process them into a larger holding file, which would then be bulk imported or b) just send them straight to the DB using the bulk insert command. You would just have to test to see which method was faster for you. Also, I am pretty sure MySQL's bulk import option allows you to specify more than one file to be imported, which might be a third option. Once the import is complete, the .csv files that have been imported get deleted. Since you have one worker process that isn't multi-threaded then a lot of the angst about file integrity should be alleviated. On Jun 11, 9:15 pm, Pete Wright [EMAIL PROTECTED] wrote: That's an awesome tip - thanks so much guys for pointing me in a far less painful direction. My only concern now is speed in working with the file and minimizing wait states. For example, I presumably don't want anything writing to the text file while I'm loading it into the database, and similarly I don't want to take time on every single request to check the size of the text file to see if I should run a LOAD DATA on it. So, I think what I'm going to end up doing here is converting time into seconds since midnight and then mod 60 it to come up with a file name suffix. I'll also have another process running separately processing and deleting unecessary files. Do you guys see any problems with that method? On Jun 11, 6:58 pm, Jonathan Vanasco [EMAIL PROTECTED] wrote: write to a .txt log in some sort of standardized format , when it hits 10k lines run a batch query as mentioned above, mysql has its own insert format; postgres does too. both insert formats are FAR faster than going through individual inserts. when i still used mysql about 4 years ago, i remember benching an insert of geographic data. it was something like 6 hours via inserts, 4hours via grouped inserts, and 25 seconds using the native import format. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups pylons-discuss group. To post to this group, send email to pylons-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/pylons-discuss?hl=en -~--~~~~--~~--~--~---
Re: Bulk inserts
Beware of multiple threads and/or processes writing to the file at the same time. It may make sense to use a mutex. I agree with the earlier recommendations to use mysql load data in file. It really is a lot faster. You can completely remove Python from the picture when importing the file. Here's an example: mysqlimport \ --user=user \ --password=password \ --columns=id,name \ --compress \ --fields-optionally-enclosed-by='' \ --fields-terminated-by='\t' \ --lines-terminated-by='\n' \ --local \ --lock-tables \ --replace \ --verbose \ mydatabase mylocalfile.tsv To write that tsv file, I used the csv module with: DEFAULT_KARGS = dict(dialect='excel-tab', lineterminator='\n') def create_default_reader(iterable): Return a csv.reader with our default options. return csv.reader(iterable, **DEFAULT_KARGS) def create_default_writer(iterable): Return a csv.writer with our default options. return csv.writer(iterable, **DEFAULT_KARGS) There's one more thing to keep in mind. If you do a load data from file using Python, make sure you pass local_infile=True to MySQLdb's connect function. Happy Hacking! -jj On Wed, Jun 11, 2008 at 6:15 PM, Pete Wright [EMAIL PROTECTED] wrote: That's an awesome tip - thanks so much guys for pointing me in a far less painful direction. My only concern now is speed in working with the file and minimizing wait states. For example, I presumably don't want anything writing to the text file while I'm loading it into the database, and similarly I don't want to take time on every single request to check the size of the text file to see if I should run a LOAD DATA on it. So, I think what I'm going to end up doing here is converting time into seconds since midnight and then mod 60 it to come up with a file name suffix. I'll also have another process running separately processing and deleting unecessary files. Do you guys see any problems with that method? On Jun 11, 6:58 pm, Jonathan Vanasco [EMAIL PROTECTED] wrote: write to a .txt log in some sort of standardized format , when it hits 10k lines run a batch query as mentioned above, mysql has its own insert format; postgres does too. both insert formats are FAR faster than going through individual inserts. when i still used mysql about 4 years ago, i remember benching an insert of geographic data. it was something like 6 hours via inserts, 4hours via grouped inserts, and 25 seconds using the native import format. -- I, for one, welcome our new Facebook overlords! http://jjinux.blogspot.com/ --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups pylons-discuss group. To post to this group, send email to pylons-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/pylons-discuss?hl=en -~--~~~~--~~--~--~---
Re: Bulk inserts
if you use my method above I would do something like this: rename file to .err and send an email to an admin with as much detail as possible while possibly saving said error information in a separate file in case the email doesn't come through. The main concern here would be whether or not the records already inserted can be wrapped in a transaction and rolled back on error. I am guessing no. Which means you will need to manually figure out what the problem with the insert was, fix it, remove all the records that were already inserted, and then rename .err file to .csv for the worker thread to pick it up again. On Jun 11, 10:20 pm, Contact 42 [EMAIL PROTECTED] wrote: Jonathan Vanasco wrote: write to a .txt log in some sort of standardized format , when it hits 10k lines run a batch query what happens if an insert fails ? hd --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups pylons-discuss group. To post to this group, send email to pylons-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/pylons-discuss?hl=en -~--~~~~--~~--~--~---