Fastest way to pack CSV file

stbalbach Fri, 21 Apr 2017 23:30:10 +0200

What would be the fastest way to remove lines from a medium-size CSV file?

For example given a CSV:
    
    
    Bill,0,0,0
    Bill,1,1,1
    Mary,0,0,0
    Todd,0,0,0
    etc..


And a list of names to delete:
    
    
    Mary
    Todd
    

Remove the lines with those names. The CSV's are anywhere from 20,000 to 
100,000 lines long. The list of names to remove are typically from 1 to 500 
names long. The list of names are deleted across ~50 different CSV files. 
Reading a CSV into memory is OK for each file, but not all CSVs at once (the 
real CSVs are larger).

The method I use now is straight forward: read the CSV and list of delete-names 
into an array, for each CSV line check if it's in the deletion list and if not 
write the line back out otherwise skip.

The existing code in GNU Awk:
    
    
      
      c = split(readfile(namefile), a, "\n")
      d = split(readfile(csvfile), b, "\n")
      
      while(i++ < d) {
        if(b[i] == "") continue
        split(b[i],g,",")
        mark = j = 0
        while(j++ < c) {
          if(a[j] == "") continue
          regname = "^" regesc2(a[j]) "$"  # regex match first field
          if( strip(g[1]) ~ regname) {
            mark = 1                       # In delete list - skip
            break
          }
        }
        if(!mark) {                        # Not in delete list - write out line
          print b[i]
        }
      }
    

Looking for new method ideas before I rewrite in Nim.

Fastest way to pack CSV file

Reply via email to