Re: [CODE4LIB] Data Lifecycle Tracking & Documentation Tools

Joe Hourcle Fri, 13 Mar 2015 09:11:14 -0700

On Wed, 11 Mar 2015, davesgonechina wrote:

Hi John,


Good question - we're taking in XLS, CSV, JSON, XML, and on a bad day PDF
of varying file sizes, each requiring different transformation and audit
strategies, on both regular and irregular schedules. New batches often
feature schema changes requiring modification to ingest procedures, which
we're trying to automate as much as possible but obviously require a human
chaperone.

Mediawiki is our default choice at the moment, but then I would still be
looking for a good workflow management model for the structure of the wiki,
especially since in my experience wikis are often a graveyard for the best
intentions.

A few places that you might try asking this question again, to see if youcan find a solution that better answers your question:

The American Society for Information Science & Technology's Research DataAccess & Preservation group. It has a lot of librarians & archivists init, as well as people from various research disiplines:


        http://mail.asis.org/mailman/listinfo/rdap
        http://www.asis.org/rdap/

...

The Research Data Alliance has a number of groups that might be relevant.Here are a few that I suspect are the best fit:


        Libraries for Research Data IG
        https://rd-alliance.org/groups/libraries-research-data.html

        Reproducibility IG
        https://rd-alliance.org/groups/reproducibility-ig.html

        Research Data Provenance IG
        https://rd-alliance.org/groups/research-data-provenance.html

        Data Citation WG
        (as this fits into their 'dynamic data' problem)
        https://rd-alliance.org/groups/data-citation-wg.html

('IG' is 'Interest Group', which are long-lived. 'WG' is 'Working Group'which are formed to solve a specific problem and then disband)

The group 'Publishing Data Workflows' might seem to be appropriate butit's actually 'Workflows for Publishing Data' not 'Publishing of DataWorkflows' (which falls under 'Data Provenance' and 'Data Citation')

There was a presentation at the meeting earlier this week by AndreasRauber in the Data Citation group on workflows using git or SQL databasesto be able to track appending or modification for CSV and similar ASCIIfiles.

...

Also, I would consider this to be on-topic for Stack Exchange's "OpenData" site (and I'm one of the moderators for the site):


        http://opendata.stackexchange.com/

-Joe

On Tue, Mar 10, 2015 at 8:10 PM, Scancella, John <j...@loc.gov> wrote:

Dave,

How are you getting the metadata streams? Are they actual stream objects,
or files, or database dumps, etc?

As for the tools, I have used a number of the ones you listed below. I
personally prefer JIRA (and it is free for non-profit). If you are ok if
editing in wiki syntax I would recommend mediaWiki (it is what powers
Wikipedia). You could also take a look at continuous deployment
technologies like Virtual Machines (virtualbox), linux containers (docker),
and rapid deployment tools (ansible, salt). Of course if you are doing lots
of code changes you will want to test all of this continually (Jenkins).

John Scancella
Library of Congress, OSI

-----Original Message-----
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
davesgonechina
Sent: Tuesday, March 10, 2015 6:05 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Data Lifecycle Tracking & Documentation Tools

Hi all,

One of my projects involves harvesting, cleaning and transforming steady
streams of metadata from numerous publishers. It's an infinite loop but
every cycle can be a little bit or significantly different. Many issue
tracking tools are designed for a linear progression that ends in
deployment, not a circular workflow, and I've not hit upon a tool or use
strategy that really fits.

The best illustration I've found so far of the type of workflow I'm
talking about is the DCC Curation Lifecycle Model <
http://www.dcc.ac.uk/sites/default/files/documents/publications/DCCLifecycle.pdf

.

Here are some things I've tried or thought about trying:

   - Git comments
   - Github Issues
   - MySQL comments
   - Bash script logs
   - JIRA
   - Trac
   - Trello
   - Wiki
   - Unfuddle
   - Redmine
   - Zendesk
   - Request Tracker
   - Basecamp
   - Asana

Thoughts?

Dave

Re: [CODE4LIB] Data Lifecycle Tracking & Documentation Tools

Reply via email to