Dear list,

For years I've used OpenRefine to do basic sanity checks and cleaning of
CSV files before batch upload (either directly or with SAFBuilder).
OpenRefine makes it very easy to do things like trim whitespace, facet on
text values or custom patterns to eyeball outliers like invalid dates or
ISSNs, and you can even write Python (though it's Python 2 and quite
cumbersome). This is much more powerful and methodical than using a
spreadsheet application for the same task, but still becomes tedious when
you have dozens of metadata fields and hundreds or thousands of records.

To make a long story short, I've just written a metadata cleaning pipeline
geared towards working with CSVs in the DSpace ecosystem. Its
implementation is basically a series of checks and fixes applied as a
pipeline. For example, the order is roughly:

1. Strip leading, trailing, and excessive whitespace
2. Strip newlines
3. Remove "unnecessary" Unicode characters like non-breaking spaces
4. Fix invalid multi-value separators like "Kenya|Ethiopia"
5. Drop duplicate metadata values
6. Validate subject terms against AGROVOC REST API
7. Validate languages against ISO 639-2 or ISO 639-3
8. Validate ISSNs and ISBNs
9. Validate dates against ISO 8601 (and warn if date missing)

It is slightly geared towards our repository's workflow, but I think the
implementation is simple and powerful enough that many of you could benefit
from it. I will keep working to extend it. If you are interested in using
or improving it you can find the code on GitHub:

https://github.com/alanorth/csv-metadata-quality

Regards,
-- 
Alan Orth
alan.o...@gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch
"In heaven all the interesting people are missing." ―Friedrich Nietzsche

-- 
All messages to this mailing list should adhere to the DuraSpace Code of 
Conduct: https://duraspace.org/about/policies/code-of-conduct/
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dspace-tech/CAKKdN4VD5x%3DWFR8tMtPKbfVoYsYWPHDhKtYwSBCNp3JnHMk7Tw%40mail.gmail.com.

Reply via email to