Re: [CODE4LIB] Publishing large datasets

2014-07-24 Thread Julia Bauder
What everybody else has said is completely true -- the type of data makes a
huge, huge difference in how you want to present it on the Web.

If it's social-sciences-type data, though, and you're interested in making
it explorable in a regular web browser, you might take a look at SDA. SDA
stands for Survey Documentation and Analysis, but it will work on any
data that you can reasonably represent in a spreadsheet-type format (rows
of cases with columns of values for different variables), even if it's an
overwhelmingly massive number of rows and columns. It's not cheap, but I
really like the user experience from the front end. (I teach a *lot* of
students to use it when I'm wearing my data services librarian hat.)
http://sda.berkeley.edu/

IASSIST (the International Association for Social Science Information
Science and Technology) is a good resource on this topic for social
sciences data:
http://www.iassistdata.org/resources/category/data-management-and-curation.
Their mailing list is closed, but I'm a member, so if you're working with
social sciences data I'd be happy to post your question there and pass on
any responses.

Julia



*

Julia Bauder

Social Studies and Data Services Librarian

Grinnell College Libraries

 Sixth Ave.

Grinnell, IA 50112






On Wed, Jul 23, 2014 at 4:29 PM, Kyle Banerjee kyle.baner...@gmail.com
wrote:

 We've been facing increasing requests to help researchers publish datasets.
 There are many dimensions to this problem, but one of them is applying
 appropriate metadata and mounting them so they can be explored with a
 regular web browser or downloaded by expert users using specialized tools.

 Datasets often are large. One that we used for a pilot project contained
 well over 10,000 objects with a total size of about 1 TB. We've been asked
 to help with much larger and more complex datasets.

 The pilot was successful but our current process is neither scalable nor
 sustainable. We have some ideas on how to proceed, but we're mostly making
 things up. Are there methods/tools/etc you've found helpful? Also, where
 should we look for ideas? Thanks,

 kyle



Re: [CODE4LIB] Publishing large datasets

2014-07-24 Thread Fleming, Declan
Hi Kyle - 

We did a series of webinars on this last year:  
http://duraspace.org/taxonomy/term/188

Declan

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle 
Banerjee
Sent: Wednesday, July 23, 2014 2:29 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Publishing large datasets

We've been facing increasing requests to help researchers publish datasets.
There are many dimensions to this problem, but one of them is applying 
appropriate metadata and mounting them so they can be explored with a regular 
web browser or downloaded by expert users using specialized tools.

Datasets often are large. One that we used for a pilot project contained well 
over 10,000 objects with a total size of about 1 TB. We've been asked to help 
with much larger and more complex datasets.

The pilot was successful but our current process is neither scalable nor 
sustainable. We have some ideas on how to proceed, but we're mostly making 
things up. Are there methods/tools/etc you've found helpful? Also, where should 
we look for ideas? Thanks,

kyle


[CODE4LIB] Publishing large datasets

2014-07-23 Thread Kyle Banerjee
We've been facing increasing requests to help researchers publish datasets.
There are many dimensions to this problem, but one of them is applying
appropriate metadata and mounting them so they can be explored with a
regular web browser or downloaded by expert users using specialized tools.

Datasets often are large. One that we used for a pilot project contained
well over 10,000 objects with a total size of about 1 TB. We've been asked
to help with much larger and more complex datasets.

The pilot was successful but our current process is neither scalable nor
sustainable. We have some ideas on how to proceed, but we're mostly making
things up. Are there methods/tools/etc you've found helpful? Also, where
should we look for ideas? Thanks,

kyle


Re: [CODE4LIB] Publishing large datasets

2014-07-23 Thread Mills, Yvonne Maria
There are several options - depending on the type of datasets. Can you provide 
a little more info? In the meantime - 

Have you checked out DCC and Dataverse?

http://www.dcc.ac.uk/resources/how-guides/cite-datasets

http://datascience.iq.harvard.edu/dataverse


Yvonne


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle 
Banerjee
Sent: Wednesday, July 23, 2014 4:29 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Publishing large datasets

We've been facing increasing requests to help researchers publish datasets.
There are many dimensions to this problem, but one of them is applying 
appropriate metadata and mounting them so they can be explored with a regular 
web browser or downloaded by expert users using specialized tools.

Datasets often are large. One that we used for a pilot project contained well 
over 10,000 objects with a total size of about 1 TB. We've been asked to help 
with much larger and more complex datasets.

The pilot was successful but our current process is neither scalable nor 
sustainable. We have some ideas on how to proceed, but we're mostly making 
things up. Are there methods/tools/etc you've found helpful? Also, where should 
we look for ideas? Thanks,

kyle


Re: [CODE4LIB] Publishing large datasets

2014-07-23 Thread Joe Hourcle
On Jul 23, 2014, at 5:29 PM, Kyle Banerjee wrote:

 We've been facing increasing requests to help researchers publish datasets.
 There are many dimensions to this problem, but one of them is applying
 appropriate metadata and mounting them so they can be explored with a
 regular web browser or downloaded by expert users using specialized tools.
 
 Datasets often are large. One that we used for a pilot project contained
 well over 10,000 objects with a total size of about 1 TB. We've been asked
 to help with much larger and more complex datasets.
 
 The pilot was successful but our current process is neither scalable nor
 sustainable. We have some ideas on how to proceed, but we're mostly making
 things up. Are there methods/tools/etc you've found helpful? Also, where
 should we look for ideas? Thanks,


The tools I use are too customized for our field to be of much use to anyone 
else, so can't help on that part of the question.


I'd really recommend trying to reach out to someone working in data informatics 
in the field that the data is from, as they would have recommendations on 
specific metadata that should be captured.


For the general 'data publication' community, it's coalescing, but still a bit 
all over the place.  Here are some of the ones that I know about:

JISC has a 'Data Publication' mailing list:

https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=DATA-PUBLICATION

ASIST runs a 'Research Data Access  Preservation' conference and 
mailing list:

http://www.asis.org/rdap/
http://mail.asis.org/mailman/listinfo/rdap

... and they put most of the presentations up on slideshare:

http://www.slideshare.net/asist_org/

The Research Data Alliance has two working groups on the topic, 
Publishing Services and Publishing Data Workflows:

https://rd-alliance.org/group/rdawds-publishing-services-wg.html

https://rd-alliance.org/group/rdawds-publishing-data-workflows-wg.html


I'm also one of the moderators of the Open Data site on Stack Exchange, which 
has some questions that might be relevant:

Let's suppose I have potentially interesting data. How to distribute?
http://opendata.stackexchange.com/q/768/263

Benefits of using CC0 over CC-BY for data
http://opendata.stackexchange.com/q/26/263

... or just ask a new question.


I'd also recommend that when you catalog your data, that you also consider 
adding DataCite metadata, so that we can try to make it easier for others to 
cite your data.   (specific implementation recommendations for data citation 
are still evolving, but general principles have been released; if you have 
questions, feel free to ask me, as I think we need to add some clarification to 
what we mean on some of the items).

http://www.datacite.org/
https://www.force11.org/datacitation


As I see it, you're dealing with data that's in the problem range -- if it were 
larger, the department collecting the data would have a system in place 
already; if it were smaller, it's easier to manage as a single item for deposit.


-Joe