Re: [IPT] Daily feeds and archive history

Kennedy, Jonathan Tue, 19 Feb 2019 08:13:40 -0800

Thanks very much, this is helpful feedback.

On a related note, Harvard-IQSS created a platform called Dataverse 
(https://dataverse.org/about) around 2007 and one interesting element is that 
they published a method for hashing datasets. This is done for the purpose of 
creating a citation element that can be used to verify that you have downloaded 
the same data. Passing along in case this is of interest to the group:


http://best-practices.dataverse.org/data-citation/

Best regards,
Jonathan A. Kennedy
Director of Biodiversity Informatics
Harvard University Herbaria,
Department of Organismic and Evolutionary Biology

From: Daniel Noesgaard <[email protected]>
Date: Tuesday, February 19, 2019 at 3:22 AM
To: Quentin Groom <[email protected]>, Tim Robertson 
<[email protected]>
Cc: "Kennedy, Jonathan" <[email protected]>, "[email protected] 
list" <[email protected]>, helpdesk <[email protected]>
Subject: Re: [IPT] Daily feeds and archive history

I might also add that every download from GBIF.org–be it a single dataset or an 
aggregate–is archived and given a unique, persistent DOI for citation. And that 
citations of downloads count against all the datasets that contributed to that 
download.

--
Daniel Noesgaard
Science Communications Coordinator
GBIF | Global Biodiversity Information Facility - Secretariat
Universitetsparken 15
DK-2100 Copenhagen, Denmark
E: [email protected]<mailto:[email protected]>
W: www.gbif.org
T: +45 35 32 08 74



From: Quentin Groom <[email protected]>
Date: Tuesday, 19 February 2019 at 08.38
To: Tim Robertson <[email protected]>
Cc: "Kennedy, Jonathan" <[email protected]>, "[email protected] 
list" <[email protected]>, helpdesk <[email protected]>, Daniel Noesgaard 
<[email protected]>
Subject: Re: [IPT] Daily feeds and archive history

While it would be great to have versioned datasets I generally create a 
snapshot of the data used in a paper and archive this in Zenodo. This gives 
complete reproducibility without putting extra demands on the data providers. I 
do however need to cite the source and the snapshot.
Regards
Quentin

On Mon, 18 Feb 2019, 17:45 Tim Robertson 
<[email protected]<mailto:[email protected]> wrote:
Hi Jonathan
(adding GBIF helpdesk to the CC)

This is just a quick answer which I expect will result in follow up questions.

In terms of citation, we use a DOI to identify the concept of a dataset, not 
the specific version. E.g. 
https://doi.org/10.15468/cup0nk<https://urldefense.proofpoint.com/v2/url?u=https-3A__doi.org_10.15468_cup0nk&d=DwMGaQ&c=WO-RGvefibhHBZq3fL85hQ&r=CdeDWKDCq4utpRBAQsRWPsFEuA9hFIpReg9XUuWRHOA&m=QAbsRjSWihrdVjG7RYt6giVaADF8smdKP1WZnbfukuc&s=gxEzg7QhLSvKKIBG7rDac6LWCKd-bjMirk5DHQx2y9I&e=>
If you start deleting copies of data (e.g. a background housekeeping task) what 
will break are links to the downloads in the IPT pages.  
https://ipt.huh.harvard.edu/ipt/resource?r=huh_all_records&v=1.3
This may or may not be considered a problem for you.

I think others might have contacted you about suggestions for improving the 
dataset titles being used but if not I would suggest considering correctly 
formatted titles as they are used in  many places 
(https://www.gbif.org/dataset/4e4f97d2-4670-4b24-b982-261e0a450faf)<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.gbif.org_dataset_4e4f97d2-2D4670-2D4b24-2Db982-2D261e0a450faf-29&d=DwMGaQ&c=WO-RGvefibhHBZq3fL85hQ&r=CdeDWKDCq4utpRBAQsRWPsFEuA9hFIpReg9XUuWRHOA&m=QAbsRjSWihrdVjG7RYt6giVaADF8smdKP1WZnbfukuc&s=hLi2fk3gePaQiOUHBC7Lb3KNPFmLiRKgOlK1tdYUqFA&e=>.

I hope this helps as a start,
Tim





From: IPT <[email protected]<mailto:[email protected]>> on 
behalf of "Kennedy, Jonathan" 
<[email protected]<mailto:[email protected]>>
Date: Monday, 18 February 2019 at 18.31
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: [IPT] Daily feeds and archive history

Hi All,

I am finishing an upgrade to the Harvard University Herbaria IPT instance and 
have configured our feeds for daily auto-publish. The HUH has invested in a 
mass digitization workflow and we are currently creating ~20,000 new vascular 
records per month (with minimal data), so we do have new records on a daily 
basis. However, our DwC archives are fairly large (100MB+), so we can’t keep 
the daily archive history. I am looking for guidance on how it will work with 
GBIF dataset citation if we do not preserve each daily archive. It seems 
problematic if a version of our dataset is used and cited but cannot be 
reconstructed.

Best regards,
Jonathan A. Kennedy
Director of Biodiversity Informatics
Harvard University Herbaria,
Department of Organismic and Evolutionary Biology
_______________________________________________
IPT mailing list
[email protected]<mailto:[email protected]>
https://lists.gbif.org/mailman/listinfo/ipt<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.gbif.org_mailman_listinfo_ipt&d=DwMGaQ&c=WO-RGvefibhHBZq3fL85hQ&r=CdeDWKDCq4utpRBAQsRWPsFEuA9hFIpReg9XUuWRHOA&m=QAbsRjSWihrdVjG7RYt6giVaADF8smdKP1WZnbfukuc&s=DV0zFYttiKPqFg1nTOXbnwdsZXT8Zm3O1ZF1tTialnE&e=>

_______________________________________________
IPT mailing list
[email protected]
https://lists.gbif.org/mailman/listinfo/ipt

Re: [IPT] Daily feeds and archive history

Reply via email to