I found some oddities and I am not exactly sure where to go next.

We are noticing the following while processing meta.xml in darwin core archives 
produced by IPT (and other servers):

Schema validation failed, continuing unvalidated
XMLSyntaxError: Element '{http://rs.tdwg.org/dwc/text/}coreid': This element is 
not expected. Expected is ( {http://rs.tdwg.org/dwc/text/}coreId )


It seems like most consumers are not actually validating meta.xml using the 
schema, and the producers are generating files out of compliance with the 
schema.

Most of the Darwin Core archives I have manually inspected and tried to 
validate contain meta.xml with lowercase "i" in coreid despite the Standard 
indicating capital "I" in coreId.


I poked at the GBIF Darwin Core Validator 3 code repo and found this:

schema.meta=https://raw.githubusercontent.com/tdwg/dwc/master/standard/documents/text/tdwg_dwc_text.xsd,http://rs.tdwg.org/dwc/text/tdwg_dwc_text.xsd


The first link leads to 404, the second leads to an xsd that contains the 
proper coreId.  So maybe the Validator is not being "strict" about validation 
against the schema?


Digging around I find some "historic" evidence of changes that occurred when 
tdwg migrated from google code to github and moved their services around, and 
that GBIF might have "grabbed a copy" from somewhere while waiting for the 
public site to become available again.


The current "published" schema contains coreId:

http://rs.tdwg.org/dwc/text/tdwg_dwc_text.xsd


The current "master" branch contains coreId:

https://github.com/tdwg/dwc/blob/master/docs/text/tdwg_dwc_text.xsd


The "gh-pages" branch (one would think this is being used to generate the 
website, but apparently not) includes the lowercase coreid:

https://github.com/tdwg/dwc/blob/gh-pages/text/tdwg_dwc_text.xsd




So I am wondering why the GBIF Validator isn't noticing this. And why is IPT 
generating meta.xml that does not agree with the schema?


Reference:

http://rs.tdwg.org/dwc/text/

2.1.2 Elements
Element Description
<core> An <archive> must contain exactly one <core> element, representing the 
data entity (the actual file and its column header mappings to Darwin Core 
terms) upon which records are based. If extensions are being used, each record 
in the core data must have a unique identifier. The field for this identifier 
must be specified in an explicit <id> field in order to associate extension 
records with the core record.
<extension> An <archive> may define zero or more <extension> elements, each 
representing an individual extension entity directly related to the core. In 
addition to the general file attributes described below, every extension entity 
must have an explicit <coreId> field to relate the extension record to a row in 
the core entity. The extension itself does not have to have a unique ID field 
and many rows can point to the same core record.



Thanks!

Dan Stoner
iDigBio / ACIS Laboratory
University of Florida
_______________________________________________
IPT mailing list
IPT@lists.gbif.org
https://lists.gbif.org/mailman/listinfo/ipt

Reply via email to