What would be the fastest way to remove lines from a medium-size CSV file?
For example given a CSV:
Bill,0,0,0
Bill,1,1,1
Mary,0,0,0
Todd,0,0,0
etc..
And a list of names to delete:
Mary
Todd
Remove the lines with those names. The CSV's are anywhere from 20,000 to
100,000 lines long. The list of names to remove are typically from 1 to 500
names long. The list of names are deleted across ~50 different CSV files.
Reading a CSV into memory is OK for each file, but not all CSVs at once (the
real CSVs are larger).
The method I use now is straight forward: read the CSV and list of delete-names
into an array, for each CSV line check if it's in the deletion list and if not
write the line back out otherwise skip.
The existing code in GNU Awk:
c = split(readfile(namefile), a, "\n")
d = split(readfile(csvfile), b, "\n")
while(i++ < d) {
if(b[i] == "") continue
split(b[i],g,",")
mark = j = 0
while(j++ < c) {
if(a[j] == "") continue
regname = "^" regesc2(a[j]) "$" # regex match first field
if( strip(g[1]) ~ regname) {
mark = 1 # In delete list - skip
break
}
}
if(!mark) { # Not in delete list - write out line
print b[i]
}
}
Looking for new method ideas before I rewrite in Nim.