[API-users] What happens to previous data after dataset/crawl?

Tim Robertson Mon, 29 Aug 2016 09:47:47 +0000

Hi David,

I hope you are well.


This stems from the community developed Darwin Core (DwC) standard building on 
top of the Dublin Core (DC) standard, rather than something GBIF have 
introduced.  I?m afraid it is a little messy, but I will try and explain.

The original thinking by the DwC authors was that dwc:occurrenceID would be 
used to identify an occurrence in nature, while dc:identifier would be a 
digital record identifier.
In practice almost everyone mapped their database foreign key to occurrenceID 
and thus they became pretty synonymous.

GBIF recommend that publishers put unique identifiers on records using 
dwc:occurrenceID.  The IPT helps enforce this.  This way the publisher can very 
easily control how many records they wish to appear in GBIF.  We pass through 
dc:identifier as you notice, but simply treat it like any other non-interpreted 
field ? we have never interpreted this field, and it will not control how we 
identify records in the GBIF.org index.  However, it makes sense to provide it, 
as there could well be systems other than GBIF indexing data, and they may 
likely recognise Dublin Core terms.

Historically, where users have not provided occurrenceID they must have given 
data mapped with dwc:insitutuionCode, dwc:collectionCode and dwc:catalogNumber 
or the equivalent in the ABCD standard.
If a provider adds occurrenceID and leaves those 3 fields the same, we will 
recognise this and update the records.  If they were to remove one of those 
triplets, we would insert new records and the old ones would be deleted at some 
point.  It is a manual process to allow us to engage with publishers before 
running deletions.

I notice that the dataset you linked to was published in 2007, before 
dwc:occurrenceID existed.  It is therefore using the dwc:insitutuionCode, 
dwc:collectionCode and dwc:catalogNumber identifier strategy.

Please note, that Darwin Core recommends concatenating the 3 fields to create a 
dwc:occurrenceID.  Please be aware that this approach means that should someone 
chose to e.g. change the collection code, the occurrence record ID will also 
change thus removing all linkability.  If this is expected, then forging unique 
ids for records using e.g. UUIDs or similar would be a more robust longer term 
solution and in general we recommend targeting this.

We do recommend people strive to provide occurrenceID, even on older data.  
This simplifies things going forward.

I hope this helps, but please feel free to ask me any questions around this.

This is off topic, but while you are reading please know that we expect 
stateOrProvince to be a filter on GBIF next week along with locality, protocol, 
license, organismID, publishingOrgKey (API only), crawlID (API only).  I know 
you have interest in this functionality.

Best wishes,
Tim



From: API-users <api-users-bounces at lists.gbif.org<mailto:api-users-bounces 
at lists.gbif.org>> on behalf of Herbario SANT <sant.herbarium at 
gmail.com<mailto:sant.herbar...@gmail.com>>
Date: Sunday 28 August 2016 at 16:00
To: "api-users at lists.gbif.org<mailto:api-users at lists.gbif.org>" 
<api-users at lists.gbif.org<mailto:api-users at lists.gbif.org>>
Subject: Re: [API-users] What happens to previous data after dataset/crawl?

Hi

I take the opportunity to ask about the difference between two GBIF terms:

What is the difference of "occurrenceID" compared to "identifier"?  Both have 
the same value in this dataset:
http://www.gbif.org/occurrence/1291766512/verbatim
http://api.gbif.org/v1/occurrence/1291766512

I see "occurrenceID" well explained here:
http://gbif.blogspot.com.es/2014/04/ipt-v21.html
http://rs.tdwg.org/dwc/terms/#occurrenceID

But I can't find the explanation for "identifier", which I think some 
institutions have been incorrectly understanding as "occurrenceID".
For example:
http://www.gbif.org/occurrence/142907792/verbatim
http://api.gbif.org/v1/occurrence/142907792

There is an "identifier" in that occurrence, but no "occurrenceID".
1) What is exactly the meaning of that "identifier"?  Why is it not explained 
in dwc terms page?
2) What happens if the data provider keeps all data UNCHANGED, but adds the 
"occurrenceID" which was missing?
    Would next GBIF reindex keep the same number of records and add their 
occurrenceIDs? (perhaps looking at the triplet in that "identifier"?)
    Would later on be safe to change any fields in the dataset (even 
"identifier", "catalognumber", ...) if that data provider keeps those 
occurrenceIDs stable?

Thanks


On 27 August 2016 at 08:01, Roderic Page <Roderic.Page at 
glasgow.ac.uk<mailto:Roderic.Page at glasgow.ac.uk>> wrote:
Just wanted to check the consequences of the following dataset operation.

Say I have a dataset with 10 occurrences with occurrence ids 1-10. In my local 
database I now assign those 10 occurrences new identifiers a-j. If I create a 
new DwCA file for my data and crawl the new archive, my expectation is:

1. Old data with ids 1-10 is deleted from GBIF index
2. New data with ids a-j is indexed

So, end result is dataset has 10 occurrences. I'm asking because I know in the 
past the some datasets have changed identifiers and this has resulted in 
records with old and new identifiers coexisting in GBIF index, resulting in 
duplicated data.

Obviously it would be nice to have stable, unchanging identifiers for 
occurrences, but the for data set I'm working with the creators have changed 
their minds between versions of the data :(

Regards,

Rod

Get Outlook for iOS<https://aka.ms/o0ukef>


_______________________________________________
API-users mailing list
API-users at lists.gbif.org<mailto:API-users at lists.gbif.org>
http://lists.gbif.org/mailman/listinfo/api-users




--
David Garc?a San Le?n
Herbario SANT
Facultade de Farmacia - Laboratorio de Bot?nica
Universidade de Santiago de Compostela
15782 - Galicia (Spain)
http://www.usc.es/herbario

-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://lists.gbif.org/pipermail/api-users/attachments/20160829/cac3e67e/attachment-0001.html>

[API-users] What happens to previous data after dataset/crawl?

Reply via email to