Hi Nick:

The primary problem inherent in this thesis is not directly the establishment 
of new highs, thus filter limits pushed ever higher, but the inability to catch 
important data-entry errors -- the ones measurable in orders of magnitude.

Example:  Red-naped Sapsucker occurs in Adams Co. mostly as single individuals 
and, for argument's sake, we'll say that there are 107 such entries (I, 
personally, never saw more than one/day in the county in my 14 years there).  
Then, someone scores two of 'em, which the filter flags.  The observer 
describes well and/or photographs both birds (perhaps they were both banded at 
the banding station) and the report is validated.  With an automated system for 
filter limits, the filter would then climb to 2.  The next fall, someone 
mistakenly hits the '2' key when s/he intended to hit the '1' key and the new 
automatic filter limit allows it to enter the data set without review, despite 
the fact that it would be only the second time in at least 109 occurrences in 
the county recorded by eBird in which more than one was reported.  Any 
statistician will tell you that that is significantly different than normal, 
yet such an entry would receive no oversight and an error would be included in 
the data set.  However, I would have maintained the filter at 1 and would have 
caught that error, requesting details from the observer who, hopefully, would 
realize the error and fix it.

Granted, it's probably unlikely that the filter limit would climb anymore in 
that situation, but let's extend the argument anyway with a different species.

eBird has 39 checklists from Larimer Co. during the seven-day period of 15-21 
October that include Double-crested Cormorant (DCCO).  The current filter limit 
for the county in October is 600, the max from the seven-day period is 2500, 
and the average abundance is 397, which is well below the current filter limit. 
 The single highest count accounts for >16% of all of the DCCOs for that time 
period in Larimer and any statistician will tell you that that is an outlier.  
Since it is not possible to determine what the second-highest tally is without 
downloading all of the Larimer data or hunting for it among the huge number of 
occurrences on the eBird map for October (one cannot generate maps for time 
periods shorter than one month and reviewers cannot get an output that will 
generate this sort of information, at least, not at all easily), I cannot say 
what that next-highest value is, but by removing the 2500, the average 
count/checklist drops to 341.  So, for argument's sake, let's use the arbitrary 
value of 700 for the second-highest tally.  This, then, might have been the 
filter limit under an automatic-filter system.

With the validation of the 2500 count, the filter limit gets bumped.  Now, the 
range of possible data-entry errors that are automatically accepted increases 
dramatically.  Now, not only do all of the potential mistakes beginning with 
the digit '1' get accepted (e.g., 1000 for 100, 1100 for 100, 1200 for 120), 
but now many possible errors beginning with the digit '2' get accepted (e.g., 
2000 for 200, 2200 for 200 or even 20, etc.).  However, with the lower filter 
limit, not only will observers be more likely to catch their errors ("why is 
that entry of 20 flagged? oh, I accidentally hit the '2' twice as well as the 
'0' twice) before checklist submission, the local reviewer will have a chance 
to confirm that one actually intended 2200 rather than some other entry.

The take-home message is that outliers are outliers and should not create the 
bounds between which data are considered "non-outliers."

Tony

Tony Leukering
Largo, FL

 

 

 

-----Original Message-----
From: Komar, Nick (CDC/OID/NCEZID) (CDC/OID/NCEZID) <[email protected]>
To: coloradodipper <[email protected]>; cobirds <[email protected]>
Cc: clw37 <[email protected]>; bls42 <[email protected]>; mji26 
<[email protected]>
Sent: Tue, Jan 21, 2014 3:13 pm
Subject: RE: [cobirds] Colorado eBird:  Filters and filter limits



Tony:
 
Wow. I had no idea so much time and effort goes into the filtering process for 
Colorado ebird entries. Volunteers who put in as much time and effort (and 
above all, quality time and quality effort) as you do deserve a big award, or 
even better, a big reward!
 
Ebird team:
 
Regarding filters in place on the number of individuals reported for a species, 
I’ve noticed some filters recently that need adjusting. Examples for Larimer 
County (Colorado) would include Double-crested Cormorant and California Gull. I 
think the filters are currently 1000 and 400, respectively. However, it is not 
unusual at certain hotspots to surpass these filters. The problem is that these 
large congregations are site specific, and it would be too labor intensive to 
have filters established by human beings (even superhumans like Tony) at scales 
below county level. So, here is a thought (in case you all had not already 
thought of it). For hotspots or broader geographic areas (e.g. counties) with a 
certain threshold number of checklists, have ebird automatically generate 
filters. This is already in place for birds not on the default list for the 
location, because adding a species requires the user to confirm the addition. 
But for the number of individuals for the species already on the default list, 
an automatic variable filter could be programmed for all species that would be 
equal to each species’ previous high count for the location (and period). In 
this way, ebird would ask for confirmation for any reported datum only when a 
new high count is established for that species at that location and period. In 
this way, these site-specific filters would automatically increase over time as 
new high counts are established at a fine geographic scale. For most (common) 
species that don’t really merit the effort to continuously manage filters even 
at broad geographic scales, this system could mitigate input errors that would 
erroneously establish new high counts reported to ebird for that location and 
period. For rare species that merit human review, a lower fixed threshold still 
makes sense. If this system were put in place, and gulls start piling up this 
winter at Horseshoe Lake in Loveland, CO, then every time I report more than 
400 Cal Gulls, I would not be required to comment (a bird log feature); 
however, if at any point I report a new high count for Horseshoe Lake, I would 
be cross-checked by ebird to ensure the input number was not an error.
 
If this idea has already been considered, I apologize for taking up your 
valuable time, and keep up all the good work.
 
Nick Komar
Fort Collins CO
 
From: [email protected] [mailto:[email protected]]On Behalf Of 
[email protected]
Sent: Tuesday, January 21, 2014 9:15 AM
To: [email protected]
Cc: [email protected]; [email protected]; [email protected]
Subject: [cobirds] Colorado eBird: Filters and filter limits
 
Cobirders:

Since the question has come up privately a couple of times recently, I thought 
that I would respond publicly in this venue, as the information may be 
appreciated by all of Colorado's eBirders.

In the beginning, eBird was a very simple and simplistic world.  From the 
start, though, the powers-that-were deemed it important to have filters for 
input data in order to flag entries that were atypical.  These first filters 
were usually state-based, one-filter-per-state things that provided gross 
estimates of numbers acceptable for that state in each of the 12 months of the 
calendar.  Chris Wood and I constructed that first Colorado filter.  At that 
time, there were no non-species entries.  That is, no spuhs, slashes, hybrids, 
subspecies.  There were just species.

As eBird has become more refined with much more capacity and capability, 
filters have become incredibly more complex.  First, was the separation of the 
statewide filter with regional filters, for Colorado there five:  Northeast, 
Southeast, Mountains, Northwest, Southwest.  That, obviously, required some 
fine-tuning of each of those five filters to more-closely match each 
subregion's avifauna, such as not include Northern Bobwhite in the three 
western filters, exclude Gunnison Sage-Grouse from the two eastern filters.  
Second, was the addition of various non-species-level entries, the spuhs and 
the slashes (e.g., Semipalmated/Western Sandpiper, peep sp.).  That meant going 
through each of the five then-extant filters and adding those non-species 
entries relevant to each filter, which was done on a fairly conservative basis 
-- only the really common non-specific entries were added, such as Snow/Ross's 
Goose and Cackling/Canada Goose.  That wasn't too bad; tedious, but not too 
bad, and at that time, I was the only person working on Colorado's eBird 
filters.

With the addition of Marshall Iliff as the final member of what is familiarly 
called the eBird trinity (Chris Wood, Brian Sullivan, and Marshall) that runs 
the program, eBird's abilities expanded further, with a more-in-depth taxonomy 
that was to cover the entire planet.  Hybrids were added, many, many, many more 
non-species entries were added, even in the ABA-area, such that there are 
probably now more non-species-level entries available in the ABA area than 
species-level entries, some used exceedingly rarely, some widely used.

Then, eBird tackled the 'April problem.'  Those of us in the filter and 
record-review aspects of eBird (and I was, and am, doing both) had for years 
complained that the rigid monthly structure to the filters made for some major 
problems, with April being the poster child for such problems.  In much of the 
ABA area, particularly the Lower 48, filter makers/editors had to decide to 
filter all occurrences of a migrant species that arrived in the filter region 
in the last few days of April, or allow all occurrences of such species, even 
in early April when they were unknown.  In Colorado, MacGillivray's Warblers is 
an excellent case in point, with the vast bulk of migrants arriving in May, but 
with a very small number typically noted in the last week of April, but unknown 
in the state prior to the 22nd or so.

The solution was to throw out the monthly framework, replacing it with, 
essentially, a weekly framework, but not tied to any particular idea of 'week.' 
 While there are limits in all things -- and this new system's overarching 
limit was a maximum of 13 temporal filter periods per species per filter, the 
new system allowed chopping up, particularly, the short, intense spring 
migration of most migrant species into periods as small as five days, with each 
period allowed its own filter limit.  Each filter period has a number that is 
'permitted,' while any larger number of birds of that species in that time 
period would require review.  As example, the in-construction Lincoln County 
filter has five filter periods covering the spring migration of Clay-colored 
Sparrow, allowing as many as 1 during 22-30 April, 9 during 1-7 May, 29 during 
8-14 May, 15 during 15-21 May, and 9 during 22-31 May.  While we could simply 
allow any number, doing so would mean that there was no way to catch data-entry 
errors of numbers, such as 10 entered instead of 1 or 355 entered instead of 35 
(and I have seen both of these mistakes, which are easy to make when using the 
number pad on computer or laptop) made.  In essence, a filter limit is the 
result of a decision about a tenuous balance between what might occur and 
data-entry errors, and such decisions need to be made for as many as 13 
temporal periods in each of as many as 400 species and 175 non-species entries 
in each filter.

While the new filter system allows an excellent amount of flexibility in 
constructing species- and location-specific filters, it is also much more 
complex and much more time-consuming to construct.  It takes me something like 
12-20 hours of tedious effort to make a new filter from scratch and not much 
less than that to use existing eBird data to fine-tune existing active filters. 
 I use the temporal spread and abundance values from existing eBird data to 
create new filters or to modify existing filters.  Depending upon the region, 
the filter includes some 300-400 species-level entries with 125-175 other taxa 
(spuhs, slashes, hybrids, etc.)., and multiple temporal periods per taxon for 
nearly every taxon.

There are 28 active filters now covering eBird Colorado.  I also have 16 
filters in some stage of construction to enable better fit to particular 
counties currently covered by more-general, multi-county filters.  The 
rationale for smaller-scale filters is generally self-evident, but as example, 
I constructed a filter for Phillips County a few years ago.  I did that because 
Phillips County eBird data were being filtered by the general northeast 
Colorado filter, which, at that point, included Weld, Morgan, Washington, Kit 
Carson, Yuma, Phillips, Sedgwick, and Logan counties.  Note that all of those 
counties but Phillips has at least part of a major water body in it.  Thus, 
Phillips County data were allowed to include large numbers of waterbird species 
that were actually fairly rare there.

However, just because a particular filter covers just one county does not mean 
that there aren't still difficult decisions about filtering to be made.  Common 
Raven provides an excellent example of this challenge.  The species is regular 
in small numbers in the far western part of Arapahoe County, but nowhere else 
in the county.  It is also regular at the Rocky Mountain Arsenal NWR in the 
southwestern corner of Adams County, but is virtually unknown in the vast 
majority of the rest of Adams.  Both Arapahoe and Adams are filtered by 
county-specific filters, so I have to decide whether to allow unfiltered Common 
Ravens from parts of those two counties where they do not occur or to have to 
review every entry of Common Raven, even those in the parts of the counties in 
which they are known to occur with regularity.


As more data are entered in eBird, the data set for a particular filter region 
becomes more robust and allows for more-precise filter limits and temporal 
periods, and I am constantly trying to incorporate fine-tuning of existing 
filters.  However, I also am endeavoring to construct new filters to get the 
review of data from counties like Lincoln out of the hands of more-general 
filters that are not particularly effective for the county (currently in the 
Southeast filter, which also includes Baca, Prowers, Kiowa, Bent, Otero, and 
Crowley counties).  Thus, you may encounter remnants of previous filter 
strategies when entering data into eBird, simply because I have not found the 
free time to completely revamp older filters.  Please bear with me on these 
minor problems while I'm still dealing with larger ones, as 100s and 100s of 
hours that I spend on eBird filter and review tasks is a volunteer effort.  
However, feel free to drop me a line about any particularly egregious filter 
problems.

Current Colorado eBird filters:
Adams
Arapahoe
Archuleta, Dolores, La Plata, Montrose, San Miguel
Boulder
Broomfield, Denver
Chaffee
Clear Creek, Gilpin
Custer
Delta, Mesa
Douglas
El Paso
Elbert
Fremont
Huerfano
Jefferson
Larimer
Las Animas
Montezuma
Northern Mountains (Grand, Jackson, Lake, Park, Summit, Teller)
Northeast (Kit Carson, Logan, Morgan, Sedgwick, Washington, Yuma)
Northwest (Eagle, Garfield, Moffat, Pitkin, Rio Blanco, Routt)
Phillips
Pueblo
San Juan
San Luis Valley (Alamosa, Conejos, Costilla, Rio Grande, Saguache)
Southeast (Baca, Bent, Cheyenne, Crowley, Kiowa, Lincoln, Otero, Prowers)
Southwest montane (Gunnison, Hinsdale, Mineral, Ouray)
Weld

Filters in construction
Archuleta
Baca
Bent
Cheyenne
Crowley, Otero
Dolores, Montrose, San Miguel
Grand, Jackson
Kit Carson, Yuma
La Plata
Lake
Lincoln
Ouray
Park
Prowers
Summit
Teller


Tony

 

 


Tony Leukering

Largo, FL

http://www.flickr.com/photos/tony_leukering/

http://aba.org/photoquiz/


 


-- 
You received this message because you are subscribed to the Google Groups 
"Colorado Birds" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/cobirds/8D0E4DB1F30A9F9-11D4-2AC2%40webmail-d207.sysops.aol.com.
For more options, visit https://groups.google.com/groups/opt_out.


-- 
You received this message because you are subscribed to the Google Groups 
"Colorado Birds" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/cobirds/8D0E505436E4150-11D4-5363%40webmail-d207.sysops.aol.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to