Hi Nick:
The primary problem inherent in this thesis is not directly the establishment
of new highs, thus filter limits pushed ever higher, but the inability to catch
important data-entry errors -- the ones measurable in orders of magnitude.
Example: Red-naped Sapsucker occurs in Adams Co. mostly as single individuals
and, for argument's sake, we'll say that there are 107 such entries (I,
personally, never saw more than one/day in the county in my 14 years there).
Then, someone scores two of 'em, which the filter flags. The observer
describes well and/or photographs both birds (perhaps they were both banded at
the banding station) and the report is validated. With an automated system for
filter limits, the filter would then climb to 2. The next fall, someone
mistakenly hits the '2' key when s/he intended to hit the '1' key and the new
automatic filter limit allows it to enter the data set without review, despite
the fact that it would be only the second time in at least 109 occurrences in
the county recorded by eBird in which more than one was reported. Any
statistician will tell you that that is significantly different than normal,
yet such an entry would receive no oversight and an error would be included in
the data set. However, I would have maintained the filter at 1 and would have
caught that error, requesting details from the observer who, hopefully, would
realize the error and fix it.
Granted, it's probably unlikely that the filter limit would climb anymore in
that situation, but let's extend the argument anyway with a different species.
eBird has 39 checklists from Larimer Co. during the seven-day period of 15-21
October that include Double-crested Cormorant (DCCO). The current filter limit
for the county in October is 600, the max from the seven-day period is 2500,
and the average abundance is 397, which is well below the current filter limit.
The single highest count accounts for >16% of all of the DCCOs for that time
period in Larimer and any statistician will tell you that that is an outlier.
Since it is not possible to determine what the second-highest tally is without
downloading all of the Larimer data or hunting for it among the huge number of
occurrences on the eBird map for October (one cannot generate maps for time
periods shorter than one month and reviewers cannot get an output that will
generate this sort of information, at least, not at all easily), I cannot say
what that next-highest value is, but by removing the 2500, the average
count/checklist drops to 341. So, for argument's sake, let's use the arbitrary
value of 700 for the second-highest tally. This, then, might have been the
filter limit under an automatic-filter system.
With the validation of the 2500 count, the filter limit gets bumped. Now, the
range of possible data-entry errors that are automatically accepted increases
dramatically. Now, not only do all of the potential mistakes beginning with
the digit '1' get accepted (e.g., 1000 for 100, 1100 for 100, 1200 for 120),
but now many possible errors beginning with the digit '2' get accepted (e.g.,
2000 for 200, 2200 for 200 or even 20, etc.). However, with the lower filter
limit, not only will observers be more likely to catch their errors ("why is
that entry of 20 flagged? oh, I accidentally hit the '2' twice as well as the
'0' twice) before checklist submission, the local reviewer will have a chance
to confirm that one actually intended 2200 rather than some other entry.
The take-home message is that outliers are outliers and should not create the
bounds between which data are considered "non-outliers."
Tony
Tony Leukering
Largo, FL
-----Original Message-----
From: Komar, Nick (CDC/OID/NCEZID) (CDC/OID/NCEZID) <[email protected]>
To: coloradodipper <[email protected]>; cobirds <[email protected]>
Cc: clw37 <[email protected]>; bls42 <[email protected]>; mji26
<[email protected]>
Sent: Tue, Jan 21, 2014 3:13 pm
Subject: RE: [cobirds] Colorado eBird: Filters and filter limits
Tony:
Wow. I had no idea so much time and effort goes into the filtering process for
Colorado ebird entries. Volunteers who put in as much time and effort (and
above all, quality time and quality effort) as you do deserve a big award, or
even better, a big reward!
Ebird team:
Regarding filters in place on the number of individuals reported for a species,
I’ve noticed some filters recently that need adjusting. Examples for Larimer
County (Colorado) would include Double-crested Cormorant and California Gull. I
think the filters are currently 1000 and 400, respectively. However, it is not
unusual at certain hotspots to surpass these filters. The problem is that these
large congregations are site specific, and it would be too labor intensive to
have filters established by human beings (even superhumans like Tony) at scales
below county level. So, here is a thought (in case you all had not already
thought of it). For hotspots or broader geographic areas (e.g. counties) with a
certain threshold number of checklists, have ebird automatically generate
filters. This is already in place for birds not on the default list for the
location, because adding a species requires the user to confirm the addition.
But for the number of individuals for the species already on the default list,
an automatic variable filter could be programmed for all species that would be
equal to each species’ previous high count for the location (and period). In
this way, ebird would ask for confirmation for any reported datum only when a
new high count is established for that species at that location and period. In
this way, these site-specific filters would automatically increase over time as
new high counts are established at a fine geographic scale. For most (common)
species that don’t really merit the effort to continuously manage filters even
at broad geographic scales, this system could mitigate input errors that would
erroneously establish new high counts reported to ebird for that location and
period. For rare species that merit human review, a lower fixed threshold still
makes sense. If this system were put in place, and gulls start piling up this
winter at Horseshoe Lake in Loveland, CO, then every time I report more than
400 Cal Gulls, I would not be required to comment (a bird log feature);
however, if at any point I report a new high count for Horseshoe Lake, I would
be cross-checked by ebird to ensure the input number was not an error.
If this idea has already been considered, I apologize for taking up your
valuable time, and keep up all the good work.
Nick Komar
Fort Collins CO
From: [email protected] [mailto:[email protected]]On Behalf Of
[email protected]
Sent: Tuesday, January 21, 2014 9:15 AM
To: [email protected]
Cc: [email protected]; [email protected]; [email protected]
Subject: [cobirds] Colorado eBird: Filters and filter limits
Cobirders:
Since the question has come up privately a couple of times recently, I thought
that I would respond publicly in this venue, as the information may be
appreciated by all of Colorado's eBirders.
In the beginning, eBird was a very simple and simplistic world. From the
start, though, the powers-that-were deemed it important to have filters for
input data in order to flag entries that were atypical. These first filters
were usually state-based, one-filter-per-state things that provided gross
estimates of numbers acceptable for that state in each of the 12 months of the
calendar. Chris Wood and I constructed that first Colorado filter. At that
time, there were no non-species entries. That is, no spuhs, slashes, hybrids,
subspecies. There were just species.
As eBird has become more refined with much more capacity and capability,
filters have become incredibly more complex. First, was the separation of the
statewide filter with regional filters, for Colorado there five: Northeast,
Southeast, Mountains, Northwest, Southwest. That, obviously, required some
fine-tuning of each of those five filters to more-closely match each
subregion's avifauna, such as not include Northern Bobwhite in the three
western filters, exclude Gunnison Sage-Grouse from the two eastern filters.
Second, was the addition of various non-species-level entries, the spuhs and
the slashes (e.g., Semipalmated/Western Sandpiper, peep sp.). That meant going
through each of the five then-extant filters and adding those non-species
entries relevant to each filter, which was done on a fairly conservative basis
-- only the really common non-specific entries were added, such as Snow/Ross's
Goose and Cackling/Canada Goose. That wasn't too bad; tedious, but not too
bad, and at that time, I was the only person working on Colorado's eBird
filters.
With the addition of Marshall Iliff as the final member of what is familiarly
called the eBird trinity (Chris Wood, Brian Sullivan, and Marshall) that runs
the program, eBird's abilities expanded further, with a more-in-depth taxonomy
that was to cover the entire planet. Hybrids were added, many, many, many more
non-species entries were added, even in the ABA-area, such that there are
probably now more non-species-level entries available in the ABA area than
species-level entries, some used exceedingly rarely, some widely used.
Then, eBird tackled the 'April problem.' Those of us in the filter and
record-review aspects of eBird (and I was, and am, doing both) had for years
complained that the rigid monthly structure to the filters made for some major
problems, with April being the poster child for such problems. In much of the
ABA area, particularly the Lower 48, filter makers/editors had to decide to
filter all occurrences of a migrant species that arrived in the filter region
in the last few days of April, or allow all occurrences of such species, even
in early April when they were unknown. In Colorado, MacGillivray's Warblers is
an excellent case in point, with the vast bulk of migrants arriving in May, but
with a very small number typically noted in the last week of April, but unknown
in the state prior to the 22nd or so.
The solution was to throw out the monthly framework, replacing it with,
essentially, a weekly framework, but not tied to any particular idea of 'week.'
While there are limits in all things -- and this new system's overarching
limit was a maximum of 13 temporal filter periods per species per filter, the
new system allowed chopping up, particularly, the short, intense spring
migration of most migrant species into periods as small as five days, with each
period allowed its own filter limit. Each filter period has a number that is
'permitted,' while any larger number of birds of that species in that time
period would require review. As example, the in-construction Lincoln County
filter has five filter periods covering the spring migration of Clay-colored
Sparrow, allowing as many as 1 during 22-30 April, 9 during 1-7 May, 29 during
8-14 May, 15 during 15-21 May, and 9 during 22-31 May. While we could simply
allow any number, doing so would mean that there was no way to catch data-entry
errors of numbers, such as 10 entered instead of 1 or 355 entered instead of 35
(and I have seen both of these mistakes, which are easy to make when using the
number pad on computer or laptop) made. In essence, a filter limit is the
result of a decision about a tenuous balance between what might occur and
data-entry errors, and such decisions need to be made for as many as 13
temporal periods in each of as many as 400 species and 175 non-species entries
in each filter.
While the new filter system allows an excellent amount of flexibility in
constructing species- and location-specific filters, it is also much more
complex and much more time-consuming to construct. It takes me something like
12-20 hours of tedious effort to make a new filter from scratch and not much
less than that to use existing eBird data to fine-tune existing active filters.
I use the temporal spread and abundance values from existing eBird data to
create new filters or to modify existing filters. Depending upon the region,
the filter includes some 300-400 species-level entries with 125-175 other taxa
(spuhs, slashes, hybrids, etc.)., and multiple temporal periods per taxon for
nearly every taxon.
There are 28 active filters now covering eBird Colorado. I also have 16
filters in some stage of construction to enable better fit to particular
counties currently covered by more-general, multi-county filters. The
rationale for smaller-scale filters is generally self-evident, but as example,
I constructed a filter for Phillips County a few years ago. I did that because
Phillips County eBird data were being filtered by the general northeast
Colorado filter, which, at that point, included Weld, Morgan, Washington, Kit
Carson, Yuma, Phillips, Sedgwick, and Logan counties. Note that all of those
counties but Phillips has at least part of a major water body in it. Thus,
Phillips County data were allowed to include large numbers of waterbird species
that were actually fairly rare there.
However, just because a particular filter covers just one county does not mean
that there aren't still difficult decisions about filtering to be made. Common
Raven provides an excellent example of this challenge. The species is regular
in small numbers in the far western part of Arapahoe County, but nowhere else
in the county. It is also regular at the Rocky Mountain Arsenal NWR in the
southwestern corner of Adams County, but is virtually unknown in the vast
majority of the rest of Adams. Both Arapahoe and Adams are filtered by
county-specific filters, so I have to decide whether to allow unfiltered Common
Ravens from parts of those two counties where they do not occur or to have to
review every entry of Common Raven, even those in the parts of the counties in
which they are known to occur with regularity.
As more data are entered in eBird, the data set for a particular filter region
becomes more robust and allows for more-precise filter limits and temporal
periods, and I am constantly trying to incorporate fine-tuning of existing
filters. However, I also am endeavoring to construct new filters to get the
review of data from counties like Lincoln out of the hands of more-general
filters that are not particularly effective for the county (currently in the
Southeast filter, which also includes Baca, Prowers, Kiowa, Bent, Otero, and
Crowley counties). Thus, you may encounter remnants of previous filter
strategies when entering data into eBird, simply because I have not found the
free time to completely revamp older filters. Please bear with me on these
minor problems while I'm still dealing with larger ones, as 100s and 100s of
hours that I spend on eBird filter and review tasks is a volunteer effort.
However, feel free to drop me a line about any particularly egregious filter
problems.
Current Colorado eBird filters:
Adams
Arapahoe
Archuleta, Dolores, La Plata, Montrose, San Miguel
Boulder
Broomfield, Denver
Chaffee
Clear Creek, Gilpin
Custer
Delta, Mesa
Douglas
El Paso
Elbert
Fremont
Huerfano
Jefferson
Larimer
Las Animas
Montezuma
Northern Mountains (Grand, Jackson, Lake, Park, Summit, Teller)
Northeast (Kit Carson, Logan, Morgan, Sedgwick, Washington, Yuma)
Northwest (Eagle, Garfield, Moffat, Pitkin, Rio Blanco, Routt)
Phillips
Pueblo
San Juan
San Luis Valley (Alamosa, Conejos, Costilla, Rio Grande, Saguache)
Southeast (Baca, Bent, Cheyenne, Crowley, Kiowa, Lincoln, Otero, Prowers)
Southwest montane (Gunnison, Hinsdale, Mineral, Ouray)
Weld
Filters in construction
Archuleta
Baca
Bent
Cheyenne
Crowley, Otero
Dolores, Montrose, San Miguel
Grand, Jackson
Kit Carson, Yuma
La Plata
Lake
Lincoln
Ouray
Park
Prowers
Summit
Teller
Tony
Tony Leukering
Largo, FL
http://www.flickr.com/photos/tony_leukering/
http://aba.org/photoquiz/
--
You received this message because you are subscribed to the Google Groups
"Colorado Birds" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/cobirds/8D0E4DB1F30A9F9-11D4-2AC2%40webmail-d207.sysops.aol.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups
"Colorado Birds" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/cobirds/8D0E505436E4150-11D4-5363%40webmail-d207.sysops.aol.com.
For more options, visit https://groups.google.com/groups/opt_out.