Re: [CODE4LIB] Publishing large datasets
What everybody else has said is completely true -- the type of data makes a huge, huge difference in how you want to present it on the Web. If it's social-sciences-type data, though, and you're interested in making it explorable in a regular web browser, you might take a look at SDA. SDA stands for Survey Documentation and Analysis, but it will work on any data that you can reasonably represent in a spreadsheet-type format (rows of cases with columns of values for different variables), even if it's an overwhelmingly massive number of rows and columns. It's not cheap, but I really like the user experience from the front end. (I teach a *lot* of students to use it when I'm wearing my data services librarian hat.) http://sda.berkeley.edu/ IASSIST (the International Association for Social Science Information Science and Technology) is a good resource on this topic for social sciences data: http://www.iassistdata.org/resources/category/data-management-and-curation. Their mailing list is closed, but I'm a member, so if you're working with social sciences data I'd be happy to post your question there and pass on any responses. Julia * Julia Bauder Social Studies and Data Services Librarian Grinnell College Libraries Sixth Ave. Grinnell, IA 50112 On Wed, Jul 23, 2014 at 4:29 PM, Kyle Banerjee kyle.baner...@gmail.com wrote: We've been facing increasing requests to help researchers publish datasets. There are many dimensions to this problem, but one of them is applying appropriate metadata and mounting them so they can be explored with a regular web browser or downloaded by expert users using specialized tools. Datasets often are large. One that we used for a pilot project contained well over 10,000 objects with a total size of about 1 TB. We've been asked to help with much larger and more complex datasets. The pilot was successful but our current process is neither scalable nor sustainable. We have some ideas on how to proceed, but we're mostly making things up. Are there methods/tools/etc you've found helpful? Also, where should we look for ideas? Thanks, kyle
Re: [CODE4LIB] Publishing large datasets
Hi Kyle - We did a series of webinars on this last year: http://duraspace.org/taxonomy/term/188 Declan -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle Banerjee Sent: Wednesday, July 23, 2014 2:29 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Publishing large datasets We've been facing increasing requests to help researchers publish datasets. There are many dimensions to this problem, but one of them is applying appropriate metadata and mounting them so they can be explored with a regular web browser or downloaded by expert users using specialized tools. Datasets often are large. One that we used for a pilot project contained well over 10,000 objects with a total size of about 1 TB. We've been asked to help with much larger and more complex datasets. The pilot was successful but our current process is neither scalable nor sustainable. We have some ideas on how to proceed, but we're mostly making things up. Are there methods/tools/etc you've found helpful? Also, where should we look for ideas? Thanks, kyle
[CODE4LIB] Publishing large datasets
We've been facing increasing requests to help researchers publish datasets. There are many dimensions to this problem, but one of them is applying appropriate metadata and mounting them so they can be explored with a regular web browser or downloaded by expert users using specialized tools. Datasets often are large. One that we used for a pilot project contained well over 10,000 objects with a total size of about 1 TB. We've been asked to help with much larger and more complex datasets. The pilot was successful but our current process is neither scalable nor sustainable. We have some ideas on how to proceed, but we're mostly making things up. Are there methods/tools/etc you've found helpful? Also, where should we look for ideas? Thanks, kyle
Re: [CODE4LIB] Publishing large datasets
There are several options - depending on the type of datasets. Can you provide a little more info? In the meantime - Have you checked out DCC and Dataverse? http://www.dcc.ac.uk/resources/how-guides/cite-datasets http://datascience.iq.harvard.edu/dataverse Yvonne -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle Banerjee Sent: Wednesday, July 23, 2014 4:29 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Publishing large datasets We've been facing increasing requests to help researchers publish datasets. There are many dimensions to this problem, but one of them is applying appropriate metadata and mounting them so they can be explored with a regular web browser or downloaded by expert users using specialized tools. Datasets often are large. One that we used for a pilot project contained well over 10,000 objects with a total size of about 1 TB. We've been asked to help with much larger and more complex datasets. The pilot was successful but our current process is neither scalable nor sustainable. We have some ideas on how to proceed, but we're mostly making things up. Are there methods/tools/etc you've found helpful? Also, where should we look for ideas? Thanks, kyle
Re: [CODE4LIB] Publishing large datasets
On Jul 23, 2014, at 5:29 PM, Kyle Banerjee wrote: We've been facing increasing requests to help researchers publish datasets. There are many dimensions to this problem, but one of them is applying appropriate metadata and mounting them so they can be explored with a regular web browser or downloaded by expert users using specialized tools. Datasets often are large. One that we used for a pilot project contained well over 10,000 objects with a total size of about 1 TB. We've been asked to help with much larger and more complex datasets. The pilot was successful but our current process is neither scalable nor sustainable. We have some ideas on how to proceed, but we're mostly making things up. Are there methods/tools/etc you've found helpful? Also, where should we look for ideas? Thanks, The tools I use are too customized for our field to be of much use to anyone else, so can't help on that part of the question. I'd really recommend trying to reach out to someone working in data informatics in the field that the data is from, as they would have recommendations on specific metadata that should be captured. For the general 'data publication' community, it's coalescing, but still a bit all over the place. Here are some of the ones that I know about: JISC has a 'Data Publication' mailing list: https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=DATA-PUBLICATION ASIST runs a 'Research Data Access Preservation' conference and mailing list: http://www.asis.org/rdap/ http://mail.asis.org/mailman/listinfo/rdap ... and they put most of the presentations up on slideshare: http://www.slideshare.net/asist_org/ The Research Data Alliance has two working groups on the topic, Publishing Services and Publishing Data Workflows: https://rd-alliance.org/group/rdawds-publishing-services-wg.html https://rd-alliance.org/group/rdawds-publishing-data-workflows-wg.html I'm also one of the moderators of the Open Data site on Stack Exchange, which has some questions that might be relevant: Let's suppose I have potentially interesting data. How to distribute? http://opendata.stackexchange.com/q/768/263 Benefits of using CC0 over CC-BY for data http://opendata.stackexchange.com/q/26/263 ... or just ask a new question. I'd also recommend that when you catalog your data, that you also consider adding DataCite metadata, so that we can try to make it easier for others to cite your data. (specific implementation recommendations for data citation are still evolving, but general principles have been released; if you have questions, feel free to ask me, as I think we need to add some clarification to what we mean on some of the items). http://www.datacite.org/ https://www.force11.org/datacitation As I see it, you're dealing with data that's in the problem range -- if it were larger, the department collecting the data would have a system in place already; if it were smaller, it's easier to manage as a single item for deposit. -Joe