Ben, One last thing to add. It's also possible that the external site may provide other metadata formats that you can harvest via OAI-PMH.
So, even though their "oai_dc" format may be invalid, there may be another format you can use. To determine if other metadata formats are available for harvesting, you can query the OAI-PMH interface using the "ListMetadataFormats" command described at: http://www.openarchives.org/OAI/openarchivesprotocol.html#ListMetadataFormats For example, here's a list of all the metadata formats that are available for harvesting from our demo.dspace.org server: http://demo.dspace.org/oai/request?verb=ListMetadataFormats - Tim On 6/22/2012 10:18 AM, Tim Donohue wrote: > Hi Ben, > > It sounds like you are trying to run an OAI-PMH Harvest of another site > (in this case it looks like an EPrints site) from the XMLUI interface. > > It looks like the main issue here is that the external site is giving > you *invalid* "oai_dc" metadata. As the OAI-PMH protocol states, > "oai_dc" is suppose to just be metadata of the format "dc.[element]": > http://www.openarchives.org/OAI/openarchivesprotocol.html > > However, in this sitution, there's a "dc.identifier.uri" field which is > Qualified Dublin Core (QDC) and not a valid oai_dc metadata field. > > This field is misunderstood by the DSpace OAI-PMH harvester, as the > harvest expects all fields to be valid oai_dc metadata. > > So, unfortunately, the main issue here is that the external site you are > harvesting is returning invalid metadata. > > The only way I can think of to "hack" a fix on the DSpace side of things > would be to modify the crosswalk that DSpace is using to transform the > "oai_dc" metadata into it's internal Qualified Dublin Core schema. The > crosswalk DSpace uses to perform this task is: > org.dspace.content.crosswalk.OAIDCIngestionCrosswalk > > It is configured by default in your dspace.cfg as the crosswalk to use > whenever DSpace encounters "dc:" namespaced fields (which are what you > see in your "oai_dc" metadata output below). That configuration is in > this area of your dspace.cfg: > https://github.com/DSpace/DSpace/blob/master/dspace/config/dspace.cfg#L484 > > Here's a few options I can think of: > > * You could create a *custom* crosswalk based on the > "OAIDCIngestionCrosswalk" that properly parse out this > "dc.identifier.uri" field and map it to the same field in DSpace. You'd > want to configure this modified crosswalk as being the one used for "dc" > metadata (see link above). > > * OR, it *might* be possible to just configure DSpace's QDCCrosswalk > (which can crosswalk Qualified Dublin Core) as the "dc:" metadata > crosswalk. You'd probably only want to do this temporarily & you'd want > to test this on a Test/Development Server (as I've *never* tried this > and am not sure what would happen, so it may error out). To do that, > you'd change the "dc" crosswalk config to point at the QDCCrosswalk > class, e.g. > > plugin.named.org.dspace.content.crosswalk.IngestionCrosswalk = \ > ... > org.dspace.content.crosswalk.QDCCrosswalk = dc, \ > ... > > Good luck! > > - Tim > > > On 6/22/2012 12:47 AM, Ben Ryan wrote: >> Hi, >> I have attempted to harvest from an OAI feed and having some >> problems processing the dc.identifier.uri field. >> An example record from the feed is: >> <record> >> <header> >> <identifier>oai:generic.eprints.org:9</identifier> >> <identifier>http://humbox.ac.uk/9/</identifier> >> <datestamp>2012-06-11T18:48:56Z</datestamp> >> <setSpec>74797065733D7265736F75726365</setSpec></header> >> <metadata> >> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" >> xmlns:dc="http://purl.org/dc/elements/1.1/" >> xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ >> http://www.openarchives.org/OAI/2.0/oai_dc.xsd" >> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> >> <dc:title>Using EEBO to compare the quarto and Folio editions of >> Shakespeare's Henry V</dc:title> >> <dc:identifier.uri>http://humbox.ac.uk/id/eprint/9</dc:identifier.uri> >> >> <dc:creator>University, Matthew Steggle, Sheffield >> Hallam</dc:creator> >> <dc:description>As EEBO has images of every book printed in England >> before 1700, it offers students studying Shakespeare the opportunity to >> look at both the quarto and Folio editions of his plays. By using EEBO >> to look at different editions of the same play we can start to think >> about the decisions made by editors when confronted with this dilemma of >> choice. Which version is best? We can also think about why these >> differences occur.</dc:description> >> <dc:date>2005</dc:date> >> <dc:type>Resource</dc:type> >> <dc:type>NonPeerReviewed</dc:type> >> <dc:format>application/msword</dc:format> >> >> <dc:identifier>http://humbox.ac.uk/9/2/EEBO_Quarto___Folio_of_Henry_V.doc</dc:identifier> >> >> <dc:identifier>Using EEBO to compare the quarto and Folio editions of >> Shakespeare's Henry V</dc:identifier> >> <dc:relation>http://humbox.ac.uk/9/</dc:relation> >> >> <dc:rights>Creative Commons Attribution Non-commercial Share Alike >> <http://creativecommons.org/licenses/by-nc-sa/2.5/></dc:rights></oai_dc:dc></metadata></record> >> >> The dc.identifier.uri field appears in the record. >> When I view the item in the full view it shows the field as >> dc.identifier.uri http://humbox.ac.uk/id/eprint/9 >> However when I view the METS metadat (using >> http://localhost:8080/xmlui/metadata/handle/123456789/4216/mets.xml) it >> shows the field as >> <dim:field element="identifier.uri" mdschema="dc"> >> http://humbox.ac.uk/id/eprint/233 >> </dim:field> >> In the database the metadat field is recorded in the metadatavalue table >> with a metadata_field_id of 72 and the entry in the >> metadatafieldregistry table shows the element name as identifier.uri as >> the field is unknown and I currently have harvester.unknownfield set >> to add. >> Can anybody point me to where I look to see why DSpace is not >> recognising the field (is it because of pattern matching for handles? >> Regards, >> Ben >> >> >> >> ------------------------------------------------------------------------------ >> >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> >> >> _______________________________________________ >> DSpace-tech mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/dspace-tech >> > ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

