Re: grep pattern to find special combinations of fields in a large csv file

face Wed, 29 Jul 2020 09:59:34 -0700

These things are so much easier in a database...with a simple MySQL table of


  cv19(recDate, county, state, FIPS, cases, deaths)

bulk load the data and execute the query

'select  recDate, county, state , count(deaths), min(deaths) from cv19 
where  deaths > 0 group by county, state order by recDate'

yields the date of the first non-zero count of deaths, the number of 
records where deaths are non-zero, and the number of deaths on the first 
reported date...

of course this doesn't answer the question of how to do this with grep or 
trivially with PERL.

On Wednesday, July 29, 2020 at 6:56:03 AM UTC-7, Lewis Downey wrote:
>
> Hello!
>
> I do not know where to start to put together a grep pattern that will 
> parse a 375,000+ line csv file in a specific way for a personal project. 
> The file is publicly available from the NY Times and contains data 
> related to tracking Covid-19 
> <https://github.com/nytimes/covid-19-data/blob/master/us-counties.csv> 
> cases and deaths. Each line consists of six fields:
>
> * a date 
> * a county name 
> * a state name 
> * FIPS number for the county 
> * cumulative number of Covid-19 cases reported in that county as of that 
> date
> * cumulative number of Covid-19 deaths reported in that county as of that 
> date
>
> I am trying to extract the date of the first reported Covid-19 death for 
> each county in the US. This would be the first time that the last field is 
> greater than 0 for unique county+state combinations. About 1/3 of all 
> counties still have no reported deaths. Those counties should not be found 
> be found by the pattern. In round numbers there are about 3100 counties in 
> the report. Roughly 2000 have reported at least one Covid-19 death. I am 
> trying to find those specific 2000ish counties and including the date of 
> the first reported death.
>
> The pattern needs to find the first instance of unique combinations of 
> county & state where the last comma-delimited field is greater than zero. 
> The pattern will be used by BBedit in an Applescript.  The Applescript 
> execution needs to be as brief as possible so as not to commandeer my 
> laptop for a meaningful chunk of time. The script will be run at least once 
> each week. By iterating through a list of counties I made Applescript, 
> BBedit, & Numbers do this but as-built the execution took too long, is 
> prone to running out of memory, and otherwise makes my laptop close to 
> unusable while it runs. An efficient grep pattern would be incredibly 
> helpful.
>
> Here is a handcrafted sample of the file:
> date,county,state,fips,cases,deaths
> 2020-01-21,Snohomish,Washington,53061,1,0
> 2020-01-22,Snohomish,Washington,53061,1,0
> 2020-01-23,Snohomish,Washington,53061,1,0
> 2020-01-24,Cook,Illinois,17031,1,0
> 2020-01-24,Snohomish,Washington,53061,1,0
> 2020-01-25,Orange,California,06059,1,0
> 2020-01-25,Cook,Illinois,17031,1,0
> 2020-01-25,Snohomish,Washington,53061,1,0
> 2020-01-26,Maricopa,Arizona,04013,1,0
> 2020-01-26,Los Angeles,California,06037,1,0
> 2020-01-26,Orange,California,06059,1,0
> 2020-01-26,Cook,Illinois,17031,1,0
> 2020-03-03,Wake,North Carolina,37183,1,0
> 2020-03-04,Wake,North Carolina,37183,1,0
> 2020-03-09,Union,New Jersey,34039,1,0
> 2020-03-21,Essex,Massachusetts,25009,41,0
> 2020-03-22,Essex,Massachusetts,25009,60,0
> 2020-03-23,Essex,Massachusetts,25009,73,1
> 2020-03-24,Essex,Massachusetts,25009,118,1
> 2020-03-24,Union,New Jersey,34039,246,2
> 2020-03-25,Essex,Massachusetts,25009,177,1
> 2020-04-09,Union,New Jersey,34039,5203,145
> 2020-04-12,Essex,Massachusetts,25009,3170,102
> 2020-04-13,Essex,Massachusetts,25009,3413,114
> 2020-04-15,Wake,North Carolina,37183,510,1
> 2020-05-06,Union,New Jersey,34039,13604,800
> 2020-05-07,Union,New Jersey,34039,13781,829
> 2020-05-08,Union,New Jersey,34039,13917,844
> 2020-06-30,Wake,North Carolina,37183,5178,47
> 2020-07-01,Wake,North Carolina,37183,5379,48
> 2020-07-02,Wake,North Carolina,37183,5590,49
>
> From that sample the grep pattern needs to find exactly these three lines:
> 2020-03-23,Essex,Massachusetts,25009,73,1    <- 1st occurrence of Essex, 
> Massachusetts w/ last field great than zero 0
> 2020-03-24,Union,New Jersey,34039,246,2        <- 1st occurrence of 
> Union, New Jersey w/ last field great than zero 0
> 2020-04-15,Wake,North Carolina,37183,510,1    <- 1st occurrence of Wake, 
> North Carolina w/ last field great than zero 0
>
> And report them as
> 2020-03-23,Essex,Massachusetts
> 2020-03-24,Union,New Jersey
> 2020-04-15,Wake,North Carolina
>
> There are other counties in the sample but none of the others have a 6th 
> comma-delimited field with a value greater than 0. Those counties should 
> not be found. 
> That said, an alternative report is perfectly workable if it is easier to 
> generate (as shown below but without the descriptive comments). It is hard 
> for me to see how this alternative report
> would be easier to generate, but grep has plenty of other qualities i 
> would not imagine.
>
> 2020-01-25,Snohomish,Washington,0        <- most recent date reported for 
> county where last comma-delimited field = 0
> 2020-01-26,Cook,Illinois,0                       <- most recent date 
> reported for county where last comma-delimited field = 0
> 2020-01-26,Los Angeles,California,0        <- most recent date reported 
> for county where last comma-delimited field = 0
> 2020-01-26,Maricopa,Arizona,0            <- most recent date reported for 
> county where last comma-delimited field = 0
> 2020-01-26,Orange,California,0            <- most recent date reported for 
> county where last comma-delimited field = 0
> 2020-03-23,Essex,Massachusetts,1       <- number of deaths first reported 
> now included in report (almost always 1)
> 2020-03-24,Union,New Jersey,2         <- number of deaths first reported 
> now included in report (almost always 1, 2 shown here to cover the case of 
> more than 1)       
> 2020-04-15,Wake,North Carolina,1       <- number of deaths first reported 
> now included in report (almost always 1)
>
> I can build grep patterns for really easy stuff but do not know how to 
> approach this. Is it even possible? 
>
> Any help you provide will be appreciated.  Thank you for taking the time 
> to think about this question.
>
> Lewis
>
>

-- 
This is the BBEdit Talk public discussion group. If you have a feature request 
or need technical support, please email "[email protected]" rather than 
posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit>
--- 
You received this message because you are subscribed to the Google Groups 
"BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/bbedit/5133d069-a8c5-4e64-a4a9-f98850ffcba4o%40googlegroups.com.

Re: grep pattern to find special combinations of fields in a large csv file

Reply via email to