Beautiful ! Thanks for sharing Alan !!!!

Hope that this kind of validation/cleaning, or a more advanced warning
system, can get into the default uploader at some point.
So we can all collectively prevent eachother from shooting ourselves in the
feet with these uploads.

best,

Bram

[image: logo] Bram Luyten
*250-B Luci*us Gordon Drive, Suite 3A, West Henrietta, NY 14586
Gaston Geenslaan 14, 3001 Leuven, Belgium
atmire.com
<http://atmire.com/website/?q=services&utm_source=emailfooter&utm_medium=email&utm_campaign=braml>


On Wed, 31 Jul 2019 at 17:08, Alan Orth <[email protected]> wrote:

> Dear list,
>
> For years I've used OpenRefine to do basic sanity checks and cleaning of
> CSV files before batch upload (either directly or with SAFBuilder).
> OpenRefine makes it very easy to do things like trim whitespace, facet on
> text values or custom patterns to eyeball outliers like invalid dates or
> ISSNs, and you can even write Python (though it's Python 2 and quite
> cumbersome). This is much more powerful and methodical than using a
> spreadsheet application for the same task, but still becomes tedious when
> you have dozens of metadata fields and hundreds or thousands of records.
>
> To make a long story short, I've just written a metadata cleaning pipeline
> geared towards working with CSVs in the DSpace ecosystem. Its
> implementation is basically a series of checks and fixes applied as a
> pipeline. For example, the order is roughly:
>
> 1. Strip leading, trailing, and excessive whitespace
> 2. Strip newlines
> 3. Remove "unnecessary" Unicode characters like non-breaking spaces
> 4. Fix invalid multi-value separators like "Kenya|Ethiopia"
> 5. Drop duplicate metadata values
> 6. Validate subject terms against AGROVOC REST API
> 7. Validate languages against ISO 639-2 or ISO 639-3
> 8. Validate ISSNs and ISBNs
> 9. Validate dates against ISO 8601 (and warn if date missing)
>
> It is slightly geared towards our repository's workflow, but I think the
> implementation is simple and powerful enough that many of you could benefit
> from it. I will keep working to extend it. If you are interested in using
> or improving it you can find the code on GitHub:
>
> https://github.com/alanorth/csv-metadata-quality
>
> Regards,
> --
> Alan Orth
> [email protected]
> https://picturingjordan.com
> https://englishbulgaria.net
> https://mjanja.ch
> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
>
> --
> All messages to this mailing list should adhere to the DuraSpace Code of
> Conduct: https://duraspace.org/about/policies/code-of-conduct/
> ---
> You received this message because you are subscribed to the Google Groups
> "DSpace Technical Support" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/dspace-tech/CAKKdN4VD5x%3DWFR8tMtPKbfVoYsYWPHDhKtYwSBCNp3JnHMk7Tw%40mail.gmail.com
> <https://groups.google.com/d/msgid/dspace-tech/CAKKdN4VD5x%3DWFR8tMtPKbfVoYsYWPHDhKtYwSBCNp3JnHMk7Tw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
All messages to this mailing list should adhere to the DuraSpace Code of 
Conduct: https://duraspace.org/about/policies/code-of-conduct/
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dspace-tech/CACwo3X2BQq2AOe68oqgrfGFPV6hS8EWGvz6UP%3DL9_%2B3wqd%3D9BQ%40mail.gmail.com.

Reply via email to