PostgreSQL text file import performance

mrgmhale Sat, 07 Apr 2007 14:21:34 -0700

This is a long one, more about some behavior I have found using PostgreSQL
with VFP, and a bit hoping someone has come up with a solution better than
the one I came up with off-the-cuff...


Over the past many years I have been pulling raw data off various Unix and
Linux minicomputers, and converting the contents of the text files into data
used to populate/update various VFP tables.  And I have several commercial
accounts that need me to update certain data daily, which I peel out of the
some of my VFP tables, place into a .csv file, and upload to their various
ftp Servers (all AutoMagically, of course).

Over the years every time I get a new commercial client I hear them
grumbling about how long it takes for them to import the .csv files using
the tools that come with MS SQL Server, Oracle, MySQL, etc.  Just on a hunch
the first time I heard that grousing, I suggested they write their own code
to handle the import process.  The first person had his programmers do that,
and was very pleased with the results.  So, each time I got a new client I
would suggest they cut their own code to manage the .csv file import as
opposed to doing a Data File Import with whatever tools came with their SQL
database.  

Well, now that I am playing with PostgreSQL a bit more (I had a little bit
of downtime today, and chose to use it on the PostgreSQL "learning
project"), I have apparently run into much the same issue as these folks had
run into.  Every month I pull off the entire FTC Do Not Call list of .csv
files.  The FTC provides these files in a manner where they are usually
separated by Area Code, and each Area Code file can contain several million
records.  I normally process them for my clients, using only the Area Codes
they need, and update their DoNotCall tables, and affected Customer.dbf
phone number DoNotCall fields as needed.

When using VFP the Import process using APPEND FROM <FileName> CSV was a bit
pokey.  After the records were imported (millions of records) I would
replace the cFullPhone field with cAreaCode + cPhoneNum.  No big deal, but
it took a while for the 2,000,000 to 5,000,000 records to process.  It was
never enough grief to bother timing the process, but it seems to have taken
a few hours.  I would just do other things between file imports, and when
the REPLACE process was running.  

Quite some time ago I was able to import all Area Codes into a .dbf file,
but in the past 3 years or so I ran into the 2Gb barrier.  So, I figured
once I got my head around some large scale database solution I would simply
import all the text files into a SQL database.  That process began today.  I
set up a connection into my PostgreSQL database Server, and used pgAdmin to
create a donotcall table, with:
         cAreaCode c(3), 
         cPhoneNum c(7), 
         and cFullPhone c(10), 
and created a Primary Key based on cAreaCode + cPhoneNum.  From there I
created a new DoNotCall project, and built an updateable remote view against
the PostgreSQL table. Then I began to import the 217 Area Code .csv file via
APPEND FROM <FileName> CSV.

Geeeez!  I could not believe how daggone slow the import was into the remote
view, and for only 547,000 records!  I knew it would be slower than a direct
VFP table hit, but this was awful.  I ended up killing the process as it
would have taken at least 2 hours to import all 547,000 records at the rate
it was going, and that was before I had to append the Area Code & Base Phone
# fields to populate the cFullPhone field.  I set the cursor buffering to
table level buffering, and the import process ran far faster, since a commit
was not being fired with each newly appended record.  But, I would still
have had to do a table update at some point, and that seemed to be the issue
at hand when it came to the greatest time delay.  

So, I did what I had been "blindly" suggesting to my commercial vendors all
this time.  I rolled my own.  I put together a real simple program that
would open the remote view (empty of records), then open the Source .csv
file, read in each line one at a time, break the data out from the data
string in each line, and create the cFullPhone value on the fly, then APPEND
BLANK in the remote view, replace the view fields with the 3 data values,
commit via =tableupdate(0, .t.), and loop to feof().

It still took 59 minutes to get the file imported (set talk off), but I only
had to make a single pass, and now have a good idea of where the performance
bottleneck is at [tableupdate()].  I figure if I put all the Area Code files
into an array, and process all the Area Code source file once, it will take
a few days to get all 50 states, and US territories done.  If I went a step
further and changed the Area Code text files from [217,8858655] to
[2178858655] (strip the commas in the source files), then I could do just a
straight import/commit line by line without replacing 3 field values (no
longer use cAreaCode and cPhoneNum, use only cFullPhone).  That could save
me a bit more time yet.

So, how are other folks handling large text/csv file imports into remote
updateable views?  I am certainly curious to see if this is a common issue
with this kind of data processing, or if I am the only one with this issue.
Now for PC tech specs; I am hosting the PostgreSQL Server on a Dell SC430
Server with 1Gb RAM with a 3.0 Ghz P4 Dual Core CPU, and the PostgreSQL
database is on a 10k rpm Ultra-Wide SCSI HDD.  The Source Text files are on
my ftp Server, running on a fast IDE HDD that is lightly used.  I am
processing the data on an AMD 2.9Ghz CPU with 2Gb RAM using VFP9.  I use a
very fast HP 100vg LAN, so data packet collisions are not an issue.

Thanks!

Gil



Gilbert M. Hale 
New Freedom Data Resources 
Pittsford, NY 
585-359-8085 - Office (Rolls To Cellular) 
585-202-4341 - Cellular/VoiceMail 
[EMAIL PROTECTED] 




--- StripMime Report -- processed MIME parts ---
multipart/mixed
  text/plain (text body -- kept)
  application/ms-tnef
---


_______________________________________________
Post Messages to: [email protected]
Subscription Maintenance: http://leafe.com/mailman/listinfo/profox
OT-free version of this list: http://leafe.com/mailman/listinfo/profoxtech
Searchable Archive: http://leafe.com/archives/search/profox
This message: http://leafe.com/archives/byMID/profox/[EMAIL PROTECTED]
** All postings, unless explicitly stated otherwise, are the opinions of the 
author, and do not constitute legal or medical advice. This statement is added 
to the messages for those lawyers who are too stupid to see the obvious.

PostgreSQL text file import performance

Reply via email to