Re: Bulk inserts

2008-06-12 Thread Shannon -jj Behrens

On Wed, Jun 11, 2008 at 7:20 PM, Contact 42 [EMAIL PROTECTED] wrote:
 Jonathan Vanasco wrote:
 write to a .txt log in some sort of standardized format , when it hits
 10k lines run a batch query
 what happens if an insert fails ?

It continues with the rest of the batch.  It'll tell you at the end
how many succeeded and how many failed.  You can read the MySQL
documentation to figure out how to look up the error messages.  My
schema is setup so that I don't get failures unless something really
awful happens.

-jj

-- 
I, for one, welcome our new Facebook overlords!
http://jjinux.blogspot.com/

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
pylons-discuss group.
To post to this group, send email to pylons-discuss@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en
-~--~~~~--~~--~--~---



Re: Bulk inserts

2008-06-12 Thread rcs_comp

 Depending on the locking behaviour of the bulk loader it could
 make sense to load the data into simple preprocessing tables
 which don't have any validation logic or primary key constraints.

 From these tables the final processing is then done. This would
 move the error handling from the bulk loader into the procedures
 inside the database, and solve bulk loading specific issues.

We have also used this method to process ~10K record batch imports
from a CSV file that is uploaded through a web interface.  We were
using MSSQL, but the theory would be the same and has worked very well
for us.


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
pylons-discuss group.
To post to this group, send email to pylons-discuss@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en
-~--~~~~--~~--~--~---



Re: Bulk inserts

2008-06-12 Thread Jonathan Vanasco

one of the things i HATE about mysql, is the 5.0 branch , nothing
failed.  they specifically wanted everything to go though, and valued
that behavior over integrity.

in terms of what happens when something does fail now?  I'm pretty
sure the mysql and postgres docs have workarounds/solutions for
that.   i've never had to code for that  , as i've only dealt with
data dumps.

i meant to add in my original post- you can use a log rotation system
to just change the 'write to' file every set interval, and then work
off that.  i think there is a python class that handles that
automagically.

i didn't think of the threads issues myself.  locking would need to be
done as mentioned above.

two things i'd bring up:
  - doing something similar in mod_perl, i've had the logging done in
a 'cleanup' handler that runs after the connection closes.  this can
often free up some blocking issues.
  - IIRC, a friend tried something similar where his apps were posting
the data to an central twisted daemon on his lan.  that daemon then
handled all of the logging , rotating, and misc management.
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
pylons-discuss group.
To post to this group, send email to pylons-discuss@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en
-~--~~~~--~~--~--~---



Re: Bulk inserts

2008-06-11 Thread rcs_comp

One other option might be to have each request dump its data to a
random file with a .tmp extension to a queue directory.  When the
process is done, it can rename the file (which should be atomic) to
a .csv file.  Then the controller exits.

You then have a second worker process checking the queue directory
for .csv files and either a) process them into a larger holding file,
which would then be bulk imported or b) just send them straight to the
DB using the bulk insert command.  You would just have to test to see
which method was faster for you.  Also, I am pretty sure MySQL's bulk
import option allows you to specify more than one file to be imported,
which might be a third option.  Once the import is complete, the .csv
files that have been imported get deleted.  Since you have one worker
process that isn't multi-threaded then a lot of the angst about file
integrity should be alleviated.

On Jun 11, 9:15 pm, Pete Wright [EMAIL PROTECTED] wrote:
 That's an awesome tip - thanks so much guys for pointing me in a far
 less painful direction. My only concern now is speed in working with
 the file and minimizing wait states. For example, I presumably don't
 want anything writing to the text file while I'm loading it into the
 database, and similarly I don't want to take time on every single
 request to check the size of the text file to see if I should run a
 LOAD DATA on it.

 So, I think what I'm going to end up doing here is converting time
 into seconds since midnight and then mod 60 it to come up with a file
 name suffix. I'll also have another process running separately
 processing and deleting unecessary files. Do you guys see any problems
 with that method?

 On Jun 11, 6:58 pm, Jonathan Vanasco [EMAIL PROTECTED] wrote:

  write to a .txt log in some sort of standardized format , when it hits
  10k lines run a batch query

  as mentioned above, mysql has its own insert format; postgres does
  too.  both insert formats are FAR faster than going through individual
  inserts.

  when i still used mysql about 4 years ago, i remember benching an
  insert of geographic data.  it was something like 6 hours via inserts,
  4hours via grouped inserts, and 25 seconds using the native import
  format.
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
pylons-discuss group.
To post to this group, send email to pylons-discuss@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en
-~--~~~~--~~--~--~---



Re: Bulk inserts

2008-06-11 Thread Shannon -jj Behrens

Beware of multiple threads and/or processes writing to the file at the
same time.  It may make sense to use a mutex.

I agree with the earlier recommendations to use mysql load data in
file.  It really is a lot faster.

You can completely remove Python from the picture when importing the
file.  Here's an example:

mysqlimport \
--user=user \
--password=password \
--columns=id,name \
--compress \
--fields-optionally-enclosed-by='' \
--fields-terminated-by='\t' \
--lines-terminated-by='\n' \
--local \
--lock-tables \
--replace \
--verbose \
mydatabase mylocalfile.tsv

To write that tsv file, I used the csv module with:

DEFAULT_KARGS = dict(dialect='excel-tab', lineterminator='\n')


def create_default_reader(iterable):
Return a csv.reader with our default options.
return csv.reader(iterable, **DEFAULT_KARGS)


def create_default_writer(iterable):
Return a csv.writer with our default options.
return csv.writer(iterable, **DEFAULT_KARGS)

There's one more thing to keep in mind.  If you do a load data from
file using Python, make sure you pass  local_infile=True to MySQLdb's
connect function.

Happy Hacking!
-jj

On Wed, Jun 11, 2008 at 6:15 PM, Pete Wright [EMAIL PROTECTED] wrote:

 That's an awesome tip - thanks so much guys for pointing me in a far
 less painful direction. My only concern now is speed in working with
 the file and minimizing wait states. For example, I presumably don't
 want anything writing to the text file while I'm loading it into the
 database, and similarly I don't want to take time on every single
 request to check the size of the text file to see if I should run a
 LOAD DATA on it.

 So, I think what I'm going to end up doing here is converting time
 into seconds since midnight and then mod 60 it to come up with a file
 name suffix. I'll also have another process running separately
 processing and deleting unecessary files. Do you guys see any problems
 with that method?


 On Jun 11, 6:58 pm, Jonathan Vanasco [EMAIL PROTECTED] wrote:
 write to a .txt log in some sort of standardized format , when it hits
 10k lines run a batch query

 as mentioned above, mysql has its own insert format; postgres does
 too.  both insert formats are FAR faster than going through individual
 inserts.

 when i still used mysql about 4 years ago, i remember benching an
 insert of geographic data.  it was something like 6 hours via inserts,
 4hours via grouped inserts, and 25 seconds using the native import
 format.
 




-- 
I, for one, welcome our new Facebook overlords!
http://jjinux.blogspot.com/

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
pylons-discuss group.
To post to this group, send email to pylons-discuss@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en
-~--~~~~--~~--~--~---



Re: Bulk inserts

2008-06-11 Thread rcs_comp

if you use my method above I would do something like this: rename file
to .err and send an email to an admin with as much detail as possible
while possibly saving said error information in a separate file in
case the email doesn't come through.  The main concern here would be
whether or not the records already inserted can be wrapped in a
transaction and rolled back on error.  I am guessing no.  Which means
you will need to manually figure out what the problem with the insert
was, fix it, remove all the records that were already inserted, and
then rename .err file to .csv for the worker thread to pick it up
again.

On Jun 11, 10:20 pm, Contact 42 [EMAIL PROTECTED] wrote:
 Jonathan Vanasco wrote:
  write to a .txt log in some sort of standardized format , when it hits
  10k lines run a batch query

 what happens if an insert fails ?

 hd
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
pylons-discuss group.
To post to this group, send email to pylons-discuss@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en
-~--~~~~--~~--~--~---