[CODE4LIB] Extracted Features Dataset Now Available for 4.8 Million Volumes/1.8 Billion Pages

Dubnicek, Ryan C Fri, 08 May 2015 09:22:19 -0700

The HathiTrust Research Center is pleased to announce the release of its 
Extracted Features Dataset (v.0.2), a dataset derived from 4.8 million public 
domain volumes, totaling over 1.8 billion pages currently available in the 
HathiTrust Digital Library collection. The dataset includes over 734 billion 
words, dozens of languages, and spans multiple centuries. Features are 
informative, quantified characteristics of a text, and include:



  *   Volume-level metadata

  *   Page-level features

     *   Part-of-speech-tagged token counts

     *   Header and footer identification

     *   Sentence and line count

     *   Algorithmic language detection

  *   Line-level features

     *   Beginning and end line character count

     *   Maximum length of the sequence of capital characters starting a line


These features allow for analysis of large worksets of volumes in the 
HathiTrust public domain collection, at scales previously intractable for most 
individual researchers. For example, page-level token (word) counts, can be 
used to help build topic models, classifications and perform other text 
analytics. Similarly, features can be used to evaluate readability of a given 
volume or workset.


How to get the data:

The entire dataset, as well as sample subsets and custom worksets, are 
available at: https://sharc.hathitrust.org/features


How to cite:

Boris Capitanu, Ted Underwood, Peter Organisciak, Sayan Bhattacharyya, Loretta 
Auvil, Colleen Fallaw, J. Stephen Downie (2015). Extracted Feature Dataset from 
4.8 Million HathiTrust Digital Library Public Domain Volumes (v0.2). [Dataset]. 
HathiTrust Research Center, doi:10.13012/j8td9v7m.


This feature dataset is provided under a Creative Commons Attribution 4.0 
International License.


About the HathiTrust Research Center:

The HTRC is a collaborative research center launched jointly by Indiana 
University and the University of Illinois, along with the HathiTrust Digital 
Library, to help meet the technical challenges of dealing with massive amounts 
of digital text that researchers face by developing cutting-edge software tools 
and cyberinfrastructure to enable advanced computational access to the growing 
digital record of human knowledge.


For more information about the HathiTrust Research Center, visit 
http://www.hathitrust.org/htrc

[CODE4LIB] Extracted Features Dataset Now Available for 4.8 Million Volumes/1.8 Billion Pages

Reply via email to