Re: grep pattern to find special combinations of fields in a large csv file

Neil Faiman Wed, 29 Jul 2020 07:16:13 -0700

AppleScript is wonderful as a glue language. As a data processing language, not 
so much.


BBEdit is happy running any sort of scripts, not just AppleScript.This looks 
like it should be trivial in Perl.

Regards,

        Neil Faiman

> On Jul 29, 2020, at 9:54 AM, Lewis Downey <[email protected]> wrote:
> 
> Hello!
> 
> I do not know where to start to put together a grep pattern that will parse a 
> 375,000+ line csv file in a specific way for a personal project. The file is 
> publicly available from the NY Times and contains data related to tracking 
> Covid-19 
> <https://github.com/nytimes/covid-19-data/blob/master/us-counties.csv> cases 
> and deaths. Each line consists of six fields:
> 
> * a date 
> * a county name 
> * a state name 
> * FIPS number for the county 
> * cumulative number of Covid-19 cases reported in that county as of that date
> * cumulative number of Covid-19 deaths reported in that county as of that date
> 
> I am trying to extract the date of the first reported Covid-19 death for each 
> county in the US. This would be the first time that the last field is greater 
> than 0 for unique county+state combinations. About 1/3 of all counties still 
> have no reported deaths. Those counties should not be found be found by the 
> pattern. In round numbers there are about 3100 counties in the report. 
> Roughly 2000 have reported at least one Covid-19 death. I am trying to find 
> those specific 2000ish counties and including the date of the first reported 
> death.
> 
> The pattern needs to find the first instance of unique combinations of county 
> & state where the last comma-delimited field is greater than zero. The 
> pattern will be used by BBedit in an Applescript.  The Applescript execution 
> needs to be as brief as possible so as not to commandeer my laptop for a 
> meaningful chunk of time. The script will be run at least once each week. By 
> iterating through a list of counties I made Applescript, BBedit, & Numbers do 
> this but as-built the execution took too long, is prone to running out of 
> memory, and otherwise makes my laptop close to unusable while it runs. An 
> efficient grep pattern would be incredibly helpful.
> 
> Here is a handcrafted sample of the file:
> date,county,state,fips,cases,deaths
> 2020-01-21,Snohomish,Washington,53061,1,0
> 2020-01-22,Snohomish,Washington,53061,1,0
> 2020-01-23,Snohomish,Washington,53061,1,0
> 2020-01-24,Cook,Illinois,17031,1,0
> 2020-01-24,Snohomish,Washington,53061,1,0
> 2020-01-25,Orange,California,06059,1,0
> 2020-01-25,Cook,Illinois,17031,1,0
> 2020-01-25,Snohomish,Washington,53061,1,0
> 2020-01-26,Maricopa,Arizona,04013,1,0
> 2020-01-26,Los Angeles,California,06037,1,0
> 2020-01-26,Orange,California,06059,1,0
> 2020-01-26,Cook,Illinois,17031,1,0
> 2020-03-03,Wake,North Carolina,37183,1,0
> 2020-03-04,Wake,North Carolina,37183,1,0
> 2020-03-09,Union,New Jersey,34039,1,0
> 2020-03-21,Essex,Massachusetts,25009,41,0
> 2020-03-22,Essex,Massachusetts,25009,60,0
> 2020-03-23,Essex,Massachusetts,25009,73,1
> 2020-03-24,Essex,Massachusetts,25009,118,1
> 2020-03-24,Union,New Jersey,34039,246,2
> 2020-03-25,Essex,Massachusetts,25009,177,1
> 2020-04-09,Union,New Jersey,34039,5203,145
> 2020-04-12,Essex,Massachusetts,25009,3170,102
> 2020-04-13,Essex,Massachusetts,25009,3413,114
> 2020-04-15,Wake,North Carolina,37183,510,1
> 2020-05-06,Union,New Jersey,34039,13604,800
> 2020-05-07,Union,New Jersey,34039,13781,829
> 2020-05-08,Union,New Jersey,34039,13917,844
> 2020-06-30,Wake,North Carolina,37183,5178,47
> 2020-07-01,Wake,North Carolina,37183,5379,48
> 2020-07-02,Wake,North Carolina,37183,5590,49
> 
> From that sample the grep pattern needs to find exactly these three lines:
> 2020-03-23,Essex,Massachusetts,25009,73,1    <- 1st occurrence of Essex, 
> Massachusetts w/ last field great than zero 0
> 2020-03-24,Union,New Jersey,34039,246,2        <- 1st occurrence of Union, 
> New Jersey w/ last field great than zero 0
> 2020-04-15,Wake,North Carolina,37183,510,1    <- 1st occurrence of Wake, 
> North Carolina w/ last field great than zero 0
> 
> And report them as
> 2020-03-23,Essex,Massachusetts
> 2020-03-24,Union,New Jersey
> 2020-04-15,Wake,North Carolina
> 
> There are other counties in the sample but none of the others have a 6th 
> comma-delimited field with a value greater than 0. Those counties should not 
> be found. 
> That said, an alternative report is perfectly workable if it is easier to 
> generate (as shown below but without the descriptive comments). It is hard 
> for me to see how this alternative report
> would be easier to generate, but grep has plenty of other qualities i would 
> not imagine.
> 
> 2020-01-25,Snohomish,Washington,0        <- most recent date reported for 
> county where last comma-delimited field = 0
> 2020-01-26,Cook,Illinois,0                       <- most recent date reported 
> for county where last comma-delimited field = 0
> 2020-01-26,Los Angeles,California,0        <- most recent date reported for 
> county where last comma-delimited field = 0
> 2020-01-26,Maricopa,Arizona,0            <- most recent date reported for 
> county where last comma-delimited field = 0
> 2020-01-26,Orange,California,0            <- most recent date reported for 
> county where last comma-delimited field = 0
> 2020-03-23,Essex,Massachusetts,1       <- number of deaths first reported now 
> included in report (almost always 1)
> 2020-03-24,Union,New Jersey,2         <- number of deaths first reported now 
> included in report (almost always 1, 2 shown here to cover the case of more 
> than 1)       
> 2020-04-15,Wake,North Carolina,1       <- number of deaths first reported now 
> included in report (almost always 1)
> 
> I can build grep patterns for really easy stuff but do not know how to 
> approach this. Is it even possible? 
> 
> Any help you provide will be appreciated.  Thank you for taking the time to 
> think about this question.
> 
> Lewis
> 
> 
> -- 
> This is the BBEdit Talk public discussion group. If you have a feature 
> request or need technical support, please email "[email protected]" 
> rather than posting here. Follow @bbedit on Twitter: 
> <https://twitter.com/bbedit <https://twitter.com/bbedit>>
> --- 
> You received this message because you are subscribed to the Google Groups 
> "BBEdit Talk" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] 
> <mailto:[email protected]>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/bbedit/455abe22-f5d7-411c-9615-bdc7968f3256o%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/bbedit/455abe22-f5d7-411c-9615-bdc7968f3256o%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
This is the BBEdit Talk public discussion group. If you have a feature request 
or need technical support, please email "[email protected]" rather than 
posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit>
--- 
You received this message because you are subscribed to the Google Groups 
"BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/bbedit/5815CCEC-E88B-4DD5-B749-8B84C4A0C78B%40faiman.org.

Re: grep pattern to find special combinations of fields in a large csv file

Reply via email to