Hadoop Driven Digital Preservation
2-4 December,
Austrian National Library, Vienna

There is just one week left to sign up for our next hackathon: 
https://hadoop-driven-digital-preservation.eventbrite.co.uk.

This hackathon will focus on using Hadoop in two digital preservation scenarios:

Web-Archiving: File Format Identification/Characterisation
A web archive usually contains a wide range of different file types. From a 
curatorial perspective the question is: Do I need to be worried? Is there a 
risk that means I should take adequate measures right now? The first step is to 
reliably identify and characterise the content of a web archive. Linguistic 
analysis can help categorise the “text/plain” content into more precise content 
types. A detailed analysis of “application/pdf” content can help cluster 
properties of the files and identify characteristics that are of special 
interest. Using the Hadoop framework and prepared sample projects for 
processing web archive content, we will be able to perform any kind of 
processing or analysis that we come up with on a large scale using a Hadoop 
Cluster. Together we will discuss what are the requirements to enable this and 
we will find out what still needs to optimised.

Digital Books: Quality Assurance, text mining (OCR Quality)
The digital objects of the Austrian National Library's digital book collection 
consists of the aggregated book object with technical and descriptive meta 
data, and the images, layout and text content for the book pages. Due to the 
massive scale of digitisation in a relatively short time period and the fact 
that the digitised books are from the 18th century and older, there are 
different types of quality issues. Using the Hadoop framework, we provide the 
means to perform any kind of large scale book processing on a book or page 
level. Linguistic analysis and language detection, for example, can help us 
determining the quality of the OCR (Optical Character Recognition), or image 
analysis can help in detecting any technical or content related issues with the 
book page images. 

Take a look at the full agenda here: 
http://wiki.opf-labs.org/display/SP/Agenda+-+Hadoop+Driven+Digital+Preservation.

Highlights of this hackathon include:

* Talks from our guest speaker, Jimmy Lin, University of Maryland 
* Taking part in our competition for the best idea and visualisation
* A chance to gain hands-on experience carrying out identification and 
characterisation experiments
* Practitioners and developers working together to address digital preservation 
challenges
* The opportunity to share experiences and knowledge about implementing Hadoop

Who should attend?

Practitioners (digital librarians and archivists, digital curators, repository 
managers, or anyone responsible for managing digital collections) You will 
learn how Hadoop might fit your organisation, how to write requirements to 
guide development and gain some hands on experience using tools yourself and 
finding out how they work. To get the most out of this training course you will 
ideally have some knowledge or experience of digital preservation.
 
Developers of all experience can participate, from writing your first Hadoop 
jobs, to working on scalable solutions for issues identified in the scenarios.

We hope to see you in Vienna!

Kind Regards,

Rebecca McGuinness
Membership and Communications Manager

Reply via email to