Re: [IPT] [EXTERNAL] Re: How does one upload large datasets to GBIF?

Matthew Blissett Tue, 07 Jul 2020 10:19:46 -0700

Hi Annie,

With additional RAM allocated, the IPT can publish proportionally largerdatasets. However, this can be inefficient (or expensive in terms ofRAM), especially if the dataset has extensions.



To construct the DWCA outside an IPT you will need:

- data files. It's a good idea to check them for common errors --incorrect number of columns, duplicate occurrenceIds and so on (the IPTdoes several checks like this).

- a meta.xml data description file, linking columns to Darwin Coreterms. This can be written by hand, using various programming languages,or (often easiest if the process isn't to be repeated) by using an IPTto make a suitable mapping and extracting the resulting file.- an eml.xml metadata file, describing the dataset. The same applieshere -- the IPT is useful for providing a UI to write this metadata,especially if all 8 are similar.



Once the DWCA exists, it should be copied to a webserver.

Note that using the registry API is not strictly necessary, and apublisher with a small, unchanging number of datasets outside the IPTneed not use it. They can simply give the helpdesk a URL for eachdataset's DWCA file, and update the DWCA files at those URLs as necessary.

Using the API is useful for adding additional datasets, making changes(e.g. changing the URL) of the existing 8, or prompting GBIF toreprocess a dataset. To use the API the technical team should create asuitable username (e.g. "usgs" or "bison") on both gbif.org andgbif-uat.org. The latter is our test system. They should then contact[email protected] to ask for permission for that account to make changesunder the USGS<https://www.gbif.org/publisher/c3ad790a-d426-4ac1-8e32-da61f81f0117>publisher, or whichever publisher is/are appropriate. This will only beon the test system at first.

It's then possible to register a new dataset under that publisher,following the example here:https://github.com/gbif/registry/tree/master/registry-examples/src/test/scriptsand see the result.

For general questions on this, the GBIF API mailing list is probablymost appropriate: https://lists.gbif.org/mailman/listinfo/api-users

If you have problems or errors with a specific dataset,[email protected] will be the best contact. (They also read bothmailing lists.)



Cheers


Matt


On 07/07/2020 17:48, Simpson, Annie wrote:

Thank you, Laura, for your replies.
The datasets have been exported from databases and cleaned. They aregenerally UTF-8 tab delimited files. So it seems that the GBIFRegistry API would be the correct solution.
We currently have 8 of these large datasets, only 2 of which would notbe updated in the future. Do you have names of GBIF Product TeamMembers whom my technical team should contact to begin this process?Is there "how to" documentation you can point me to that they shouldread first?
Annie

------------------------------------------------------------------------
*From:* Laura Anne Russell <[email protected]>
*Sent:* Tuesday, July 7, 2020 11:17 AM
*To:* Simpson, Annie <[email protected]>; [email protected]<[email protected]>*Subject:* [EXTERNAL] Re: [IPT] How does one upload large datasets toGBIF?
* This email has been received from outside of DOI - Use cautionbefore clicking on links, opening attachments, or responding. *
I could also mention that it is possible to script the creation of theDarwin Core Archives and then use the GBIF Registry API for theconnections with GBIF. Symbiota, PlutoF and some others aresuccessfully doing this. It does require some initial coordinationwith our Product Team on how to set up and coordinate the registrationprocess and potentially with our Informatics Team.
Best,

Laura

Laura Anne Russell

Programme Officer for Participation and Engagement

Global Biodiversity Information Facility (GBIF) Secretariat

[email protected] (email)

laura.anne.russell (Skype)

@pagodarose (Twitter)

#CiteTheDOI @GBIF

https://www.gbif.org/

+45 35 33 35 51 (office, direct line)

GBIF

Universitetsparken 15

DK-2100 Copenhagen Ø

Denmark
*From: *IPT <[email protected]> on behalf of "Simpson, Annie"<[email protected]>
*Date: *Tuesday, 7 July 2020 at 16.48
*To: *"[email protected]" <[email protected]>
*Subject: *[IPT] How does one upload large datasets to GBIF?

Colleagues:
What is the easiest or most popular way to send large datasets toGBIF, ones that are too large for the IPT software (I think that ismore than 100MB zipped, 10+million records)? Does one modify their IPTinstance? How? Or is there another process that is preferred?
We currently have IPT Version 2.3.6-r3985b6a installed and plan toupgrade to 2.4.0 soon.
A technical answer is what I seek (on behalf of our technical team).
Again my apologies if the answer to my question is easily found andI'm just not finding it.
Annie Simpson, BISON product owner

(she/her/hers)

BioFoundational Data Team

Science Analytics & Synthesis Program

U.S. Geological Survey

12201 Sunrise Valley Dr. Mailstop 302

Reston VA   20192

[email protected]

+1 703-648-4281

https://orcid.org/0000-0001-8338-5134

https://bison.usgs.gov

Image removed by sender. <https://bison.usgs.gov/>

        
Biodiversity Information Serving Our Nation (BISON)<https://bison.usgs.gov/>
USGS Biodiversity Information Serving Our Nation (BISON) is a unique,web-based Federal mapping resource for species occurrence data in theUnited States and its Territories and Canada, including marineExclusive Economic Zones (EEZs).
bison.usgs.gov


_______________________________________________
IPT mailing list
[email protected]
https://lists.gbif.org/mailman/listinfo/ipt

_______________________________________________
IPT mailing list
[email protected]
https://lists.gbif.org/mailman/listinfo/ipt

Re: [IPT] [EXTERNAL] Re: How does one upload large datasets to GBIF?

Reply via email to