Re: Accessing large CSV files

Jeff Zucker Fri, 09 Apr 2004 11:58:34 -0700

Michael Gerdau wrote:

_If_ I would want to use DBD::CSV on large (i.e. 50000 or more records)
CSV files, what would I have to do ?

Is there a buildin limit on the number of records DBD::CSV can handle ?

There is no built in limit. Yours is the first report I've had of the problem. My guess is that in most situations for files of that size you'd be better off loading them into into SQLite and doing your querying there.

If you do want to deal with them in DBD::CSV, the LIMIT clause is not the way to go. LIMIT first accumulates an entire result set in memory and then reduces it to the specified numbers (LIMIT is logically based on ORDER BY and you need to find the entire result set to use ORDER BY). The way to reduce the amount held in memory is to reduce the result set with WHERE clauses e.g. if you have a continuously incrementing id field and a clause WHERE id > $num AND id < $num+5000, your result set would be 5k records. With "COUNT (*) FROM x LIMIT 5000" your result set will be the entire database which will get limited to 5k records only after the result set has been gathered.

For SQL statements not using ORDER BY or aggregate functions like COUNT, you can use DBD::AnyData (which uses the same SQL parsing engine as DBD::CSV) to read lines in one at a time without accumulating a result set. If you just replace "<DATA>" in the script below with an open filehandle, the script will go through the file one record at a time, evaluate it against the SQL, then let you print (or otherwise process) that line. No more than a single line is ever held in memory.

my $dbh = DBI->connect('dbi:AnyData:');
my($table,$cols) = ( 'tmp' , 'id,phrase' );
my $sql = "SELECT phrase FROM $table WHERE id > 1";
while (<DATA>) {
    $dbh->ad_import( $table,'CSV',[$_],{col_names=>$cols});
    my @row = $dbh->selectrow_array($sql);
    print "@row\n" if @row;
    $dbh->ad_clear($table);
}
__DATA__
1,foo
2,bar
3,qux

--
Jeff

Re: Accessing large CSV files

Reply via email to