Hello! I do not know where to start to put together a grep pattern that will parse a 375,000+ line csv file in a specific way for a personal project. The file is publicly available from the NY Times and contains data related to tracking Covid-19 <https://github.com/nytimes/covid-19-data/blob/master/us-counties.csv> cases and deaths. Each line consists of six fields:
* a date * a county name * a state name * FIPS number for the county * cumulative number of Covid-19 cases reported in that county as of that date * cumulative number of Covid-19 deaths reported in that county as of that date I am trying to extract the date of the first reported Covid-19 death for each county in the US. This would be the first time that the last field is greater than 0 for unique county+state combinations. About 1/3 of all counties still have no reported deaths. Those counties should not be found be found by the pattern. In round numbers there are about 3100 counties in the report. Roughly 2000 have reported at least one Covid-19 death. I am trying to find those specific 2000ish counties and including the date of the first reported death. The pattern needs to find the first instance of unique combinations of county & state where the last comma-delimited field is greater than zero. The pattern will be used by BBedit in an Applescript. The Applescript execution needs to be as brief as possible so as not to commandeer my laptop for a meaningful chunk of time. The script will be run at least once each week. By iterating through a list of counties I made Applescript, BBedit, & Numbers do this but as-built the execution took too long, is prone to running out of memory, and otherwise makes my laptop close to unusable while it runs. An efficient grep pattern would be incredibly helpful. Here is a handcrafted sample of the file: date,county,state,fips,cases,deaths 2020-01-21,Snohomish,Washington,53061,1,0 2020-01-22,Snohomish,Washington,53061,1,0 2020-01-23,Snohomish,Washington,53061,1,0 2020-01-24,Cook,Illinois,17031,1,0 2020-01-24,Snohomish,Washington,53061,1,0 2020-01-25,Orange,California,06059,1,0 2020-01-25,Cook,Illinois,17031,1,0 2020-01-25,Snohomish,Washington,53061,1,0 2020-01-26,Maricopa,Arizona,04013,1,0 2020-01-26,Los Angeles,California,06037,1,0 2020-01-26,Orange,California,06059,1,0 2020-01-26,Cook,Illinois,17031,1,0 2020-03-03,Wake,North Carolina,37183,1,0 2020-03-04,Wake,North Carolina,37183,1,0 2020-03-09,Union,New Jersey,34039,1,0 2020-03-21,Essex,Massachusetts,25009,41,0 2020-03-22,Essex,Massachusetts,25009,60,0 2020-03-23,Essex,Massachusetts,25009,73,1 2020-03-24,Essex,Massachusetts,25009,118,1 2020-03-24,Union,New Jersey,34039,246,2 2020-03-25,Essex,Massachusetts,25009,177,1 2020-04-09,Union,New Jersey,34039,5203,145 2020-04-12,Essex,Massachusetts,25009,3170,102 2020-04-13,Essex,Massachusetts,25009,3413,114 2020-04-15,Wake,North Carolina,37183,510,1 2020-05-06,Union,New Jersey,34039,13604,800 2020-05-07,Union,New Jersey,34039,13781,829 2020-05-08,Union,New Jersey,34039,13917,844 2020-06-30,Wake,North Carolina,37183,5178,47 2020-07-01,Wake,North Carolina,37183,5379,48 2020-07-02,Wake,North Carolina,37183,5590,49 >From that sample the grep pattern needs to find exactly these three lines: 2020-03-23,Essex,Massachusetts,25009,73,1 <- 1st occurrence of Essex, Massachusetts w/ last field great than zero 0 2020-03-24,Union,New Jersey,34039,246,2 <- 1st occurrence of Union, New Jersey w/ last field great than zero 0 2020-04-15,Wake,North Carolina,37183,510,1 <- 1st occurrence of Wake, North Carolina w/ last field great than zero 0 And report them as 2020-03-23,Essex,Massachusetts 2020-03-24,Union,New Jersey 2020-04-15,Wake,North Carolina There are other counties in the sample but none of the others have a 6th comma-delimited field with a value greater than 0. Those counties should not be found. That said, an alternative report is perfectly workable if it is easier to generate (as shown below but without the descriptive comments). It is hard for me to see how this alternative report would be easier to generate, but grep has plenty of other qualities i would not imagine. 2020-01-25,Snohomish,Washington,0 <- most recent date reported for county where last comma-delimited field = 0 2020-01-26,Cook,Illinois,0 <- most recent date reported for county where last comma-delimited field = 0 2020-01-26,Los Angeles,California,0 <- most recent date reported for county where last comma-delimited field = 0 2020-01-26,Maricopa,Arizona,0 <- most recent date reported for county where last comma-delimited field = 0 2020-01-26,Orange,California,0 <- most recent date reported for county where last comma-delimited field = 0 2020-03-23,Essex,Massachusetts,1 <- number of deaths first reported now included in report (almost always 1) 2020-03-24,Union,New Jersey,2 <- number of deaths first reported now included in report (almost always 1, 2 shown here to cover the case of more than 1) 2020-04-15,Wake,North Carolina,1 <- number of deaths first reported now included in report (almost always 1) I can build grep patterns for really easy stuff but do not know how to approach this. Is it even possible? Any help you provide will be appreciated. Thank you for taking the time to think about this question. Lewis -- This is the BBEdit Talk public discussion group. If you have a feature request or need technical support, please email "[email protected]" rather than posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit> --- You received this message because you are subscribed to the Google Groups "BBEdit Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/bbedit/455abe22-f5d7-411c-9615-bdc7968f3256o%40googlegroups.com.
