On Nov 21, 2022, at 5:12 PM, RD B <rdbea...@gmail.com> wrote:

> We (Kelvin Smith Library, Case Western Reserve University) are considering
> the ProQuest TDM Studio:
> 
>   https://about.proquest.com/en/products-services/TDM-Studio/
> 
> I was curious if anyone here had any direct experience with the system they
> could share, or if there were alternatives that the community recommends
> and why.
> 
> -- 
> R. David Beales - rdbea...@gmail.com - 732-299-0390
> Library, Earth, Sol System, Orion-Cygnus Arm of the Milky Way Galaxy,
> Laniakea Supercluster


A couple of years ago I experimented with TDM Studio, and I can report that it 
worked as advertised. 

More specifically, Studio worked like the handful of similar services. One from 
Lexis/Nexus, one from JSTOR, and the one from the HathiTrust. What does that 
mean? It means a person:

  1. searches the given collection
  2. results are subsetted to a secure location
  3. using tools and APIs provided by the vendor,
     analysis is done against the results
  4. results are exported
  5. repeat until done

Many times the tools and API require a working knowledge of the Python 
programming language, and then there is the curve of learning the specific 
tools. The tools usually include a number of modeling techniques: bibliography 
creation, ngram analysis, topic modeling, and full text searching. After 
working in this area for a more than a few years now, these techniques ought to 
be considered rudimentary, and additional techniques such as the application of 
grammars, semantic indexing, and collocations ought to be included.

All of the vendors have their hands tied by contract and copyright. Each vendor 
has made agreements with publishers not to freely share content, but it is not 
possible to do text mining, natural language processing, nor data science with 
words sans the content. Consequently each vendor implements a variation on Step 
#2, above. The process would be a h3ll of a lot easier if the student, 
researcher, or scholar could:

  1. search content
  2. select items of interest
  3. download selected items sans click,
     save, click, save, click, save, etc.
  4. use a wide variety of GUI tools,
     command-line tools, or programming
     languages to do the analysis

Here licensing is probably the limiting factor, not copyright. 

Do I know of open source alternatives? No, not really, but I hope my Reader 
addresses some of these problems. Given a set of files of just about any number 
and just about any ilk and saved in a local folder/directory, the Reader:

  * converts the files to plain text
  * does all sorts of feature extraction against the result
  * distills the features into a data set (a "study carrel")
  * provides the means to compute against the data set, and
    the computing could done with GUI tools (like OpenRefine
    or AntConc), command-line tools (like grep or jq), or
    programming libraries (like Python's NLTK or spaCy)

In the end the Reader supports all of the modeling techniques alluded to above 
as well as a few others. Consequently, a person can search any vendor for 
content of interest, download the results (through click, click, click), and do 
analysis against the result.

Like all software, the Reader is never done and ought to be considered 
beta-ware, See:

  https://distantreader.org

HTH

P.S. David, nice signature.

--
Eric Lease Morgan
Navari Family Center for Digital Scholarship
Hesburgh Libraries
University of Notre Dame

574/631-8604
https://cds.library.nd.edu

Reply via email to