Dear list, For years I've used OpenRefine to do basic sanity checks and cleaning of CSV files before batch upload (either directly or with SAFBuilder). OpenRefine makes it very easy to do things like trim whitespace, facet on text values or custom patterns to eyeball outliers like invalid dates or ISSNs, and you can even write Python (though it's Python 2 and quite cumbersome). This is much more powerful and methodical than using a spreadsheet application for the same task, but still becomes tedious when you have dozens of metadata fields and hundreds or thousands of records.
To make a long story short, I've just written a metadata cleaning pipeline geared towards working with CSVs in the DSpace ecosystem. Its implementation is basically a series of checks and fixes applied as a pipeline. For example, the order is roughly: 1. Strip leading, trailing, and excessive whitespace 2. Strip newlines 3. Remove "unnecessary" Unicode characters like non-breaking spaces 4. Fix invalid multi-value separators like "Kenya|Ethiopia" 5. Drop duplicate metadata values 6. Validate subject terms against AGROVOC REST API 7. Validate languages against ISO 639-2 or ISO 639-3 8. Validate ISSNs and ISBNs 9. Validate dates against ISO 8601 (and warn if date missing) It is slightly geared towards our repository's workflow, but I think the implementation is simple and powerful enough that many of you could benefit from it. I will keep working to extend it. If you are interested in using or improving it you can find the code on GitHub: https://github.com/alanorth/csv-metadata-quality Regards, -- Alan Orth alan.o...@gmail.com https://picturingjordan.com https://englishbulgaria.net https://mjanja.ch "In heaven all the interesting people are missing." ―Friedrich Nietzsche -- All messages to this mailing list should adhere to the DuraSpace Code of Conduct: https://duraspace.org/about/policies/code-of-conduct/ --- You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group. To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/CAKKdN4VD5x%3DWFR8tMtPKbfVoYsYWPHDhKtYwSBCNp3JnHMk7Tw%40mail.gmail.com.