Re: Parsing large CSVs and inserting in MSSQL
On Tuesday 26 May 2009, Jason Durham wrote: I have been tasked with providing a mechanism to upload/parse/perform a couple of checks/save CSV data. The current test file is 28,000 records This comes up now and again on the list, so I bloged my response to keep it to hand : http://www.rachaelandtom.info/content/bulk-loading-data-databases -- Helping to synergistically lead exceptional slick value-added interfaces as part of the IT team of the year, '09 and '08 Tom Chiverton Developer Tel: +44 0161 618 5032 Fax: +44 0161 618 5099 tom.chiver...@halliwells.com 3 Hardman Square, Manchester, M3 3EB www.Halliwells.com This email is sent for and on behalf of Halliwells LLP. Halliwells LLP is a limited liability partnership registered in England and Wales under registered number OC307980 whose registered office address is at Halliwells LLP, 3 Hardman Square, Spinningfields, Manchester, M3 3EB. A list of members is available for inspection at the registered office together with a list of those non members who are referred to as partners. We use the word partner to refer to a member of the LLP, or an employee or consultant with equivalent standing and qualifications. Regulated by the Solicitors Regulation Authority. CONFIDENTIALITY This email is intended only for the use of the addressee named above and may be confidential or legally privileged. If you are not the addressee you must not read it and must not use any information contained in nor copy it nor inform any person other than Halliwells LLP or the addressee of its existence or contents. If you have received this email in error please delete it and notify Halliwells LLP IT Department on 0870 365 2500. For more information about Halliwells LLP visit www.Halliwells.com. ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:322820 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Re: Parsing large CSVs and inserting in MSSQL
I have found that using the Java Line Reader works great. It doesn't load a text file into memory but will read it line by line which is perfect for a CSV. There is a CFC out there called filereader.cfc that should get you started. J.J. On Tue, May 26, 2009 at 10:52 AM, Jason Durham jdur...@cti-stl.com wrote: I have been tasked with providing a mechanism to upload/parse/perform a couple of checks/save CSV data. The current test file is 28,000 records (~16 columns). I expect this file to grow in size each month, to an indeterminate number (hundreds of thousands). It seems that I'll need to set a limit on the number of rows being processed at each time. The two obvious options (in my eyes)... #1 break the CSV up into several smaller CSV files and write them back to the file system before processing each individually (all checks performed in memory) #2 set a large timeout value and attempt to load all of the records from the CSV into a temp table, retrieving back smaller, more manageable record sets Currently, I'm doing #2 but not limiting the number of records (28,000). The data is getting into the temp table, being retrieved, but the final insert is causing ColdFusion to hang. The final insert consists of one cfquery with an internal loop over the temp table query object. The final insert is within a single transaction so I can rollback if it fails. This process will only take place on my development machine, connecting to the live database. 2.6 Core 2 Duo, 4GB, Vista 32 with a 100MB connection to production server running MS SQL 2005 Express. Perhaps I need to take a step back and consider alternatives. Btw, the checks I'm performing are simply that 2 of the columns do not contain a unique value (FKs). I'm currently doing this with QoQs but I could probably push this off with a fairly complicated query to MS SQL. What do the CF wizards suggest? TIA ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:322778 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Re: Parsing large CSVs and inserting in MSSQL
Check out: http://msdn.microsoft.com/en-us/library/ms188365.aspx For information about preparing data for bulk import, such as the requirements for importing data from a CSV data file, see... On Tue, May 26, 2009 at 11:52 AM, Jason Durham jdur...@cti-stl.com wrote: I have been tasked with providing a mechanism to upload/parse/perform a couple of checks/save CSV data. The current test file is 28,000 records (~16 columns). I expect this file to grow in size each month, to an indeterminate number (hundreds of thousands). It seems that I'll need to set a limit on the number of rows being processed at each time. The two obvious options (in my eyes)... #1 break the CSV up into several smaller CSV files and write them back to the file system before processing each individually (all checks performed in memory) #2 set a large timeout value and attempt to load all of the records from the CSV into a temp table, retrieving back smaller, more manageable record sets Currently, I'm doing #2 but not limiting the number of records (28,000). The data is getting into the temp table, being retrieved, but the final insert is causing ColdFusion to hang. The final insert consists of one cfquery with an internal loop over the temp table query object. The final insert is within a single transaction so I can rollback if it fails. This process will only take place on my development machine, connecting to the live database. 2.6 Core 2 Duo, 4GB, Vista 32 with a 100MB connection to production server running MS SQL 2005 Express. Perhaps I need to take a step back and consider alternatives. Btw, the checks I'm performing are simply that 2 of the columns do not contain a unique value (FKs). I'm currently doing this with QoQs but I could probably push this off with a fairly complicated query to MS SQL. What do the CF wizards suggest? TIA ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:322779 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
RE: Parsing large CSVs and inserting in MSSQL
Jason, This is not what you want to hear but My advice would be don't use CF to do this. DTS (or SSIS) is quite capable of doing this in about 1/30 the time that CF is able to do this. One of our DTS packages consumes a file of over 300k rows in less than 15 seconds every few minutes. You can't get that with CF no matter how hard you try :) But if you insist on doing it with CF, don't use CFFILE. Instead read the lines in one at a time using a line reader technique - that will allow you to handle one line at a time instead of reding the entire file into memory. -Mark Mark A. Kruger, CFG, MCSE (402) 408-3733 ext 105 www.cfwebtools.com www.coldfusionmuse.com www.necfug.com -Original Message- From: Jason Durham [mailto:jdur...@cti-stl.com] Sent: Tuesday, May 26, 2009 10:53 AM To: cf-talk Subject: Parsing large CSVs and inserting in MSSQL I have been tasked with providing a mechanism to upload/parse/perform a couple of checks/save CSV data. The current test file is 28,000 records (~16 columns). I expect this file to grow in size each month, to an indeterminate number (hundreds of thousands). It seems that I'll need to set a limit on the number of rows being processed at each time. The two obvious options (in my eyes)... #1 break the CSV up into several smaller CSV files and write them back to the file system before processing each individually (all checks performed in memory) #2 set a large timeout value and attempt to load all of the records from the CSV into a temp table, retrieving back smaller, more manageable record sets Currently, I'm doing #2 but not limiting the number of records (28,000). The data is getting into the temp table, being retrieved, but the final insert is causing ColdFusion to hang. The final insert consists of one cfquery with an internal loop over the temp table query object. The final insert is within a single transaction so I can rollback if it fails. This process will only take place on my development machine, connecting to the live database. 2.6 Core 2 Duo, 4GB, Vista 32 with a 100MB connection to production server running MS SQL 2005 Express. Perhaps I need to take a step back and consider alternatives. Btw, the checks I'm performing are simply that 2 of the columns do not contain a unique value (FKs). I'm currently doing this with QoQs but I could probably push this off with a fairly complicated query to MS SQL. What do the CF wizards suggest? TIA ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:322780 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Re: Parsing large CSVs and inserting in MSSQL
Jason Durham wrote: What do the CF wizards suggest? Push this work into the database using stored procedures and bulk loaders. Use ColdFusion to get the file, move it somewhere the database bulk loader can get to it, fire off the stored procedure, and notify the user of any results or problems. Databases are optimized for handling large sets of data. ColdFusion is optimized to present that data over the intranet to clients. ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:322781 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Re: Parsing large CSVs and inserting in MSSQL
Mark Kruger wrote: Jason, But if you insist on doing it with CF, don't use CFFILE. Instead read the lines in one at a time using a line reader technique - that will allow you to handle one line at a time instead of reding the entire file into memory. I also recommended pushing the heavy lifting into the database. But I did want to point out that in ColdFusion 8, you can now use the cfloop file= tag to get a CFML line reader, so one does not need to go to the Java for this anymore. ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:322782 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Re: Parsing large CSVs and inserting in MSSQL
My advice would be don't use CF to do this. DTS (or SSIS) is quite capable of doing this in about 1/30 the time that CF is able to do this. One of our DTS packages consumes a file of over 300k rows in less than 15 seconds every few minutes. You can't get that with CF no matter how hard you try :) +1 Using dts/ssis, or even your basic bulk insert query, would certainly be faster than looping via cf. ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:322785 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
RE: Parsing large CSVs and inserting in MSSQL
Yes... I was going to mention that but I didn't want to muddy the waters :) Mark A. Kruger, CFG, MCSE (402) 408-3733 ext 105 www.cfwebtools.com www.coldfusionmuse.com www.necfug.com -Original Message- From: Ian Skinner [mailto:h...@ilsweb.com] Sent: Tuesday, May 26, 2009 11:10 AM To: cf-talk Subject: Re: Parsing large CSVs and inserting in MSSQL Mark Kruger wrote: Jason, But if you insist on doing it with CF, don't use CFFILE. Instead read the lines in one at a time using a line reader technique - that will allow you to handle one line at a time instead of reding the entire file into memory. I also recommended pushing the heavy lifting into the database. But I did want to point out that in ColdFusion 8, you can now use the cfloop file= tag to get a CFML line reader, so one does not need to go to the Java for this anymore. ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:322786 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Re: Parsing large CSVs and inserting in MSSQL
+2 for DTS/SSIS Case in point. Back in the CF 5 (or 6) days I tried to import a 20+ meg MLS xml file into SQL server using CF (Yeah I know... I had no idea at the time). The first run just about killed my dev server. After 30-60 seconds the task manager showed that the machine was using 750 megs ram (It had 512 installed). After several tries with PHP and (I think) VB script I ended up writing a little DTS using ActiveX and it went from 10 min to a blip on the task manager. It took like 5 seconds. That left a very lasting impression on me. G! On Tue, May 26, 2009 at 12:17 PM, C S cfsearch...@yahoo.com wrote: My advice would be don't use CF to do this. DTS (or SSIS) is quite capable of doing this in about 1/30 the time that CF is able to do this. One of our DTS packages consumes a file of over 300k rows in less than 15 seconds every few minutes. You can't get that with CF no matter how hard you try :) +1 Using dts/ssis, or even your basic bulk insert query, would certainly be faster than looping via cf. ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:322791 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
RE: Parsing large CSVs and inserting in MSSQL
Thanks to everybody who has replied. It appears I need to do some homework on MS SQL. :) -Original Message- From: Mark Kruger [mailto:mkru...@cfwebtools.com] Sent: Tuesday, May 26, 2009 11:03 AM To: cf-talk Subject: RE: Parsing large CSVs and inserting in MSSQL Jason, This is not what you want to hear but My advice would be don't use CF to do this. DTS (or SSIS) is quite capable of doing this in about 1/30 the time that CF is able to do this. One of our DTS packages consumes a file of over 300k rows in less than 15 seconds every few minutes. You can't get that with CF no matter how hard you try :) But if you insist on doing it with CF, don't use CFFILE. Instead read the lines in one at a time using a line reader technique - that will allow you to handle one line at a time instead of reding the entire file into memory. -Mark Mark A. Kruger, CFG, MCSE (402) 408-3733 ext 105 www.cfwebtools.com www.coldfusionmuse.com www.necfug.com -Original Message- From: Jason Durham [mailto:jdur...@cti-stl.com] Sent: Tuesday, May 26, 2009 10:53 AM To: cf-talk Subject: Parsing large CSVs and inserting in MSSQL I have been tasked with providing a mechanism to upload/parse/perform a couple of checks/save CSV data. The current test file is 28,000 records (~16 columns). I expect this file to grow in size each month, to an indeterminate number (hundreds of thousands). It seems that I'll need to set a limit on the number of rows being processed at each time. The two obvious options (in my eyes)... #1 break the CSV up into several smaller CSV files and write them back to the file system before processing each individually (all checks performed in memory) #2 set a large timeout value and attempt to load all of the records from the CSV into a temp table, retrieving back smaller, more manageable record sets Currently, I'm doing #2 but not limiting the number of records (28,000). The data is getting into the temp table, being retrieved, but the final insert is causing ColdFusion to hang. The final insert consists of one cfquery with an internal loop over the temp table query object. The final insert is within a single transaction so I can rollback if it fails. This process will only take place on my development machine, connecting to the live database. 2.6 Core 2 Duo, 4GB, Vista 32 with a 100MB connection to production server running MS SQL 2005 Express. Perhaps I need to take a step back and consider alternatives. Btw, the checks I'm performing are simply that 2 of the columns do not contain a unique value (FKs). I'm currently doing this with QoQs but I could probably push this off with a fairly complicated query to MS SQL. What do the CF wizards suggest? TIA ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:322792 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
RE: Parsing large CSVs and inserting in MSSQL
Jason, Here's a big uh oh for you If you are using MSSQL express you don't have the SSIS services that come with it. DTS may be possible through installing the active X controls on the same server - but you may be out of luck if you want to do SSIS. -Mark Mark A. Kruger, CFG, MCSE (402) 408-3733 ext 105 www.cfwebtools.com www.coldfusionmuse.com www.necfug.com -Original Message- From: Jason Durham [mailto:jdur...@cti-stl.com] Sent: Tuesday, May 26, 2009 1:27 PM To: cf-talk Subject: RE: Parsing large CSVs and inserting in MSSQL Thanks to everybody who has replied. It appears I need to do some homework on MS SQL. :) -Original Message- From: Mark Kruger [mailto:mkru...@cfwebtools.com] Sent: Tuesday, May 26, 2009 11:03 AM To: cf-talk Subject: RE: Parsing large CSVs and inserting in MSSQL Jason, This is not what you want to hear but My advice would be don't use CF to do this. DTS (or SSIS) is quite capable of doing this in about 1/30 the time that CF is able to do this. One of our DTS packages consumes a file of over 300k rows in less than 15 seconds every few minutes. You can't get that with CF no matter how hard you try :) But if you insist on doing it with CF, don't use CFFILE. Instead read the lines in one at a time using a line reader technique - that will allow you to handle one line at a time instead of reding the entire file into memory. -Mark Mark A. Kruger, CFG, MCSE (402) 408-3733 ext 105 www.cfwebtools.com www.coldfusionmuse.com www.necfug.com -Original Message- From: Jason Durham [mailto:jdur...@cti-stl.com] Sent: Tuesday, May 26, 2009 10:53 AM To: cf-talk Subject: Parsing large CSVs and inserting in MSSQL I have been tasked with providing a mechanism to upload/parse/perform a couple of checks/save CSV data. The current test file is 28,000 records (~16 columns). I expect this file to grow in size each month, to an indeterminate number (hundreds of thousands). It seems that I'll need to set a limit on the number of rows being processed at each time. The two obvious options (in my eyes)... #1 break the CSV up into several smaller CSV files and write them back to the file system before processing each individually (all checks performed in memory) #2 set a large timeout value and attempt to load all of the records from the CSV into a temp table, retrieving back smaller, more manageable record sets Currently, I'm doing #2 but not limiting the number of records (28,000). The data is getting into the temp table, being retrieved, but the final insert is causing ColdFusion to hang. The final insert consists of one cfquery with an internal loop over the temp table query object. The final insert is within a single transaction so I can rollback if it fails. This process will only take place on my development machine, connecting to the live database. 2.6 Core 2 Duo, 4GB, Vista 32 with a 100MB connection to production server running MS SQL 2005 Express. Perhaps I need to take a step back and consider alternatives. Btw, the checks I'm performing are simply that 2 of the columns do not contain a unique value (FKs). I'm currently doing this with QoQs but I could probably push this off with a fairly complicated query to MS SQL. What do the CF wizards suggest? TIA ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:322793 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
RE: Parsing large CSVs and inserting in MSSQL
I have a dormant 'roided out MS SQL 2008 Standard box I can use in lieu. In fact, the plan was to migrate to that after I've built this module. :) Thanks for catching that (and for your help). Jason -Original Message- From: Mark Kruger [mailto:mkru...@cfwebtools.com] Sent: Tuesday, May 26, 2009 1:37 PM To: cf-talk Subject: RE: Parsing large CSVs and inserting in MSSQL Jason, Here's a big uh oh for you If you are using MSSQL express you don't have the SSIS services that come with it. DTS may be possible through installing the active X controls on the same server - but you may be out of luck if you want to do SSIS. -Mark Mark A. Kruger, CFG, MCSE (402) 408-3733 ext 105 www.cfwebtools.com www.coldfusionmuse.com www.necfug.com -Original Message- From: Jason Durham [mailto:jdur...@cti-stl.com] Sent: Tuesday, May 26, 2009 1:27 PM To: cf-talk Subject: RE: Parsing large CSVs and inserting in MSSQL Thanks to everybody who has replied. It appears I need to do some homework on MS SQL. :) -Original Message- From: Mark Kruger [mailto:mkru...@cfwebtools.com] Sent: Tuesday, May 26, 2009 11:03 AM To: cf-talk Subject: RE: Parsing large CSVs and inserting in MSSQL Jason, This is not what you want to hear but My advice would be don't use CF to do this. DTS (or SSIS) is quite capable of doing this in about 1/30 the time that CF is able to do this. One of our DTS packages consumes a file of over 300k rows in less than 15 seconds every few minutes. You can't get that with CF no matter how hard you try :) But if you insist on doing it with CF, don't use CFFILE. Instead read the lines in one at a time using a line reader technique - that will allow you to handle one line at a time instead of reding the entire file into memory. -Mark Mark A. Kruger, CFG, MCSE (402) 408-3733 ext 105 www.cfwebtools.com www.coldfusionmuse.com www.necfug.com -Original Message- From: Jason Durham [mailto:jdur...@cti-stl.com] Sent: Tuesday, May 26, 2009 10:53 AM To: cf-talk Subject: Parsing large CSVs and inserting in MSSQL I have been tasked with providing a mechanism to upload/parse/perform a couple of checks/save CSV data. The current test file is 28,000 records (~16 columns). I expect this file to grow in size each month, to an indeterminate number (hundreds of thousands). It seems that I'll need to set a limit on the number of rows being processed at each time. The two obvious options (in my eyes)... #1 break the CSV up into several smaller CSV files and write them back to the file system before processing each individually (all checks performed in memory) #2 set a large timeout value and attempt to load all of the records from the CSV into a temp table, retrieving back smaller, more manageable record sets Currently, I'm doing #2 but not limiting the number of records (28,000). The data is getting into the temp table, being retrieved, but the final insert is causing ColdFusion to hang. The final insert consists of one cfquery with an internal loop over the temp table query object. The final insert is within a single transaction so I can rollback if it fails. This process will only take place on my development machine, connecting to the live database. 2.6 Core 2 Duo, 4GB, Vista 32 with a 100MB connection to production server running MS SQL 2005 Express. Perhaps I need to take a step back and consider alternatives. Btw, the checks I'm performing are simply that 2 of the columns do not contain a unique value (FKs). I'm currently doing this with QoQs but I could probably push this off with a fairly complicated query to MS SQL. What do the CF wizards suggest? TIA ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:322794 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4