grep pattern to find special combinations of fields in a large csv file

Lewis Downey Wed, 29 Jul 2020 06:56:28 -0700

Hello!

I do not know where to start to put together a grep pattern that will parse 
a 375,000+ line csv file in a specific way for a personal project. The file 
is publicly available from the NY Times and contains data related to 
tracking Covid-19 
<https://github.com/nytimes/covid-19-data/blob/master/us-counties.csv> 
cases and deaths. Each line consists of six fields:


* a date 
* a county name 
* a state name 
* FIPS number for the county 
* cumulative number of Covid-19 cases reported in that county as of that 
date
* cumulative number of Covid-19 deaths reported in that county as of that 
date

I am trying to extract the date of the first reported Covid-19 death for 
each county in the US. This would be the first time that the last field is 
greater than 0 for unique county+state combinations. About 1/3 of all 
counties still have no reported deaths. Those counties should not be found 
be found by the pattern. In round numbers there are about 3100 counties in 
the report. Roughly 2000 have reported at least one Covid-19 death. I am 
trying to find those specific 2000ish counties and including the date of 
the first reported death.

The pattern needs to find the first instance of unique combinations of 
county & state where the last comma-delimited field is greater than zero. 
The pattern will be used by BBedit in an Applescript.  The Applescript 
execution needs to be as brief as possible so as not to commandeer my 
laptop for a meaningful chunk of time. The script will be run at least once 
each week. By iterating through a list of counties I made Applescript, 
BBedit, & Numbers do this but as-built the execution took too long, is 
prone to running out of memory, and otherwise makes my laptop close to 
unusable while it runs. An efficient grep pattern would be incredibly 
helpful.

Here is a handcrafted sample of the file:
date,county,state,fips,cases,deaths
2020-01-21,Snohomish,Washington,53061,1,0
2020-01-22,Snohomish,Washington,53061,1,0
2020-01-23,Snohomish,Washington,53061,1,0
2020-01-24,Cook,Illinois,17031,1,0
2020-01-24,Snohomish,Washington,53061,1,0
2020-01-25,Orange,California,06059,1,0
2020-01-25,Cook,Illinois,17031,1,0
2020-01-25,Snohomish,Washington,53061,1,0
2020-01-26,Maricopa,Arizona,04013,1,0
2020-01-26,Los Angeles,California,06037,1,0
2020-01-26,Orange,California,06059,1,0
2020-01-26,Cook,Illinois,17031,1,0
2020-03-03,Wake,North Carolina,37183,1,0
2020-03-04,Wake,North Carolina,37183,1,0
2020-03-09,Union,New Jersey,34039,1,0
2020-03-21,Essex,Massachusetts,25009,41,0
2020-03-22,Essex,Massachusetts,25009,60,0
2020-03-23,Essex,Massachusetts,25009,73,1
2020-03-24,Essex,Massachusetts,25009,118,1
2020-03-24,Union,New Jersey,34039,246,2
2020-03-25,Essex,Massachusetts,25009,177,1
2020-04-09,Union,New Jersey,34039,5203,145
2020-04-12,Essex,Massachusetts,25009,3170,102
2020-04-13,Essex,Massachusetts,25009,3413,114
2020-04-15,Wake,North Carolina,37183,510,1
2020-05-06,Union,New Jersey,34039,13604,800
2020-05-07,Union,New Jersey,34039,13781,829
2020-05-08,Union,New Jersey,34039,13917,844
2020-06-30,Wake,North Carolina,37183,5178,47
2020-07-01,Wake,North Carolina,37183,5379,48
2020-07-02,Wake,North Carolina,37183,5590,49

>From that sample the grep pattern needs to find exactly these three lines:
2020-03-23,Essex,Massachusetts,25009,73,1    <- 1st occurrence of Essex, 
Massachusetts w/ last field great than zero 0
2020-03-24,Union,New Jersey,34039,246,2        <- 1st occurrence of Union, 
New Jersey w/ last field great than zero 0
2020-04-15,Wake,North Carolina,37183,510,1    <- 1st occurrence of Wake, 
North Carolina w/ last field great than zero 0

And report them as
2020-03-23,Essex,Massachusetts
2020-03-24,Union,New Jersey
2020-04-15,Wake,North Carolina

There are other counties in the sample but none of the others have a 6th 
comma-delimited field with a value greater than 0. Those counties should 
not be found. 
That said, an alternative report is perfectly workable if it is easier to 
generate (as shown below but without the descriptive comments). It is hard 
for me to see how this alternative report
would be easier to generate, but grep has plenty of other qualities i would 
not imagine.

2020-01-25,Snohomish,Washington,0        <- most recent date reported for 
county where last comma-delimited field = 0
2020-01-26,Cook,Illinois,0                       <- most recent date 
reported for county where last comma-delimited field = 0
2020-01-26,Los Angeles,California,0        <- most recent date reported for 
county where last comma-delimited field = 0
2020-01-26,Maricopa,Arizona,0            <- most recent date reported for 
county where last comma-delimited field = 0
2020-01-26,Orange,California,0            <- most recent date reported for 
county where last comma-delimited field = 0
2020-03-23,Essex,Massachusetts,1       <- number of deaths first reported 
now included in report (almost always 1)
2020-03-24,Union,New Jersey,2         <- number of deaths first reported 
now included in report (almost always 1, 2 shown here to cover the case of 
more than 1)       
2020-04-15,Wake,North Carolina,1       <- number of deaths first reported 
now included in report (almost always 1)

I can build grep patterns for really easy stuff but do not know how to 
approach this. Is it even possible? 

Any help you provide will be appreciated.  Thank you for taking the time to 
think about this question.

Lewis

-- 
This is the BBEdit Talk public discussion group. If you have a feature request 
or need technical support, please email "[email protected]" rather than 
posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit>
--- 
You received this message because you are subscribed to the Google Groups 
"BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/bbedit/455abe22-f5d7-411c-9615-bdc7968f3256o%40googlegroups.com.

grep pattern to find special combinations of fields in a large csv file

Reply via email to