[Tika Wiki] Update of "GrobidJournalParser" by ChrisMattmann

Apache Wiki Thu, 13 Aug 2015 23:41:37 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "GrobidJournalParser" page has been changed by ChrisMattmann:
https://wiki.apache.org/tika/GrobidJournalParser

New page:
The GrobidJournalParser uses the 
[[http://grobid.readthedocs.org/en/latest/Introduction/|GROBID (or Grobid) 
GeneRation Of BIbliographic Data]] machine learning framework to parse PDF 
files and to extract information such as  title, abstract, authors, 
affiliations, keywords, etc, from journal publications. The parser has been 
integrated into Tika. You can follow this guide to get it working on your 
system.

== Installing GROBID ==

You should be able to install GROBID from a Git checkout such as the below.

 1. `cd $HOME/src`
 2. `git clone https://github.com/kermitt2/grobid.git`
    ''now wait a while, the download is ~600MB''
 3. now build GROBID by typing `cd grobid && mvn install`

You can verify GROBID works by running its batch runner:

 1. `cd $HOME/src/grobid`
 2. `mkdir papers && mkdir out` and put some PDF paper files in papers.
 3. `java -Xmx1024m -jar 
grobid-core/target/grobid-core-0.3.4-SNAPSHOT.one-jar.jar -gH ./grobid-home/ 
-gP ./grobid-home/config/grobid.properties -dIn ./papers/ -dOut out -exe 
processFullText`

Check the `out` directory, you should see `*.tei.xml` files in there.

== Running GROBID using Tika-App ==

Grab the latest 1.11-SNAPSHOT or later version of Tika-app and run Grobid by 
following the commands below.

First we need to create the GrobidExtractor.properties file that points to 
Grobid Home, and to its configuration directory. My file looks like the 
following:

{{{
grobid.home=/Users/mattmann/git/grobid/grobid-home
grobid.properties=/Users/mattmann/git/grobid/grobid-home/config/grobid.properties
}}}

You can download 
[[https://raw.githubusercontent.com/chrismattmann/grobidparser-resources/master/org/apache/tika/parser/journal/GrobidExtractor.properties|GrobidExtractor.properties]]
 as a sample. Or better yet, you can install the following Github project and 
then modify the GrobidExtractor.properties file accordingly.

 1. `cd $HOME/src && git clone 
https://github.com/chrismattmann/grobidparser-resources.git`
 2. edit 
`$HOME/src/grobidparser-resources/org/apache/tika/parser/journal/GrobidExtractor.properties`

Now you can run GROBID via Tika-app with the following command on a sample PDF 
file. Note the order of the classpath - it is extremely important to keep the 
order as it allows Tika and its Jars to come first, and GROBID (and its large 
numbers of Jars) to come last.

{{{
java -classpath 
$HOME/git/grobidparser-resources/:$HOME/src/tika-app/target/tika-app-1.11-SNAPSHOT.jar:$HOME/grobid/lib/\*
 org.apache.tika.cli.TikaCLI 
--config=$HOME/git/grobidparser-resources/tika-config.xml -J 
$HOME/git/grobid/papers/ICSE06.pdf
}}}

Which should produce as output (e.g., if piped to `python -mjson.tool` for 
pretty printing):

{{{
[
    {
        "Author": "End User Computing Services",
        "Company": "ACM",
        "Content-Length": "200435",
        "Content-Type": "application/pdf",
        "Creation-Date": "2006-02-15T21:13:58Z",
        "Last-Modified": "2006-02-15T21:16:01Z",
        "Last-Save-Date": "2006-02-15T21:16:01Z",
        "SourceModified": "D:20060215211344",
        "X-Parsed-By": [
            "org.apache.tika.parser.CompositeParser",
            "org.apache.tika.parser.journal.JournalParser"
        ],
        "X-TIKA:content": "<html 
xmlns=\"http://www.w3.org/1999/xhtml\";>\n<head>\n<meta 
name=\"access_permission:extract_for_accessibility\" content=\"true\" />\n<meta 
name=\"meta:save-date\" content=\"2006-02-15T21:16:01Z\" />\n<meta 
name=\"access_permission:modify_annotations\" content=\"true\" />\n<meta 
name=\"Creation-Date\" content=\"2006-02-15T21:13:58Z\" />\n<meta 
name=\"grobid:header_Address\" content=\"Pasadena, CA 91109, USA Los Angeles, 
CA 90089, USA\" />\n<meta name=\"access_permission:fill_in_form\" 
content=\"true\" />\n<meta name=\"created\" content=\"Wed Feb 15 13:13:58 PST 
2006\" />\n<meta name=\"grobid:header_FullAffiliations\" 
content=\"[Affiliation{name='null', url='null', institutions=[California 
Institute of Technology], departments=null, laboratories=[Jet Propulsion 
Laboratory], country='USA', postCode='91109', postBox='null', region='CA', 
settlement='Pasadena', addrLine='null', marker='1', addressString='null', 
affiliationString='null', failAffiliation=false}, Affiliation{name='null', 
url='null', institutions=[University of Southern California], 
departments=[Computer Science Department], laboratories=null, country='USA', 
postCode='90089', postBox='null', region='CA', settlement='Los Angeles', 
addrLine='null', marker='2', addressString='null', 
affiliationString='null',..snip..",
        "X-TIKA:parse_time_millis": "11529",
        "access_permission:assemble_document": "true",
        "access_permission:can_modify": "true",
        "access_permission:can_print": "true",
        "access_permission:can_print_degraded": "true",
        "access_permission:extract_content": "true",
        "access_permission:extract_for_accessibility": "true",
        "access_permission:fill_in_form": "true",
        "access_permission:modify_annotations": "true",
        "created": "Wed Feb 15 13:13:58 PST 2006",
        "creator": "End User Computing Services",
        "date": "2006-02-15T21:16:01Z",
        "dc:creator": "End User Computing Services",
        "dc:format": "application/pdf; version=1.4",
        "dc:title": "Proceedings Template - WORD",
        "dcterms:created": "2006-02-15T21:13:58Z",
        "dcterms:modified": "2006-02-15T21:16:01Z",
        "grobid:header_Abstract": "Modern scientific research is increasingly 
conducted by virtual communities of scientists distributed around the world. 
The data volumes created by these communities are extremely large, and growing 
rapidly. The management of the resulting highly distributed, virtual data 
systems is a complex task, characterized by a number of formidable technical 
challenges, many of which are of a software engineering nature. In this paper 
we describe our experience over the past seven years in constructing and 
deploying OODT, a software framework that supports large, distributed, virtual 
scientific communities. We outline the key software engineering challenges that 
we faced, and addressed, along the way. We argue that a major contributor to 
the success of OODT was its explicit focus on software architecture. We 
describe several large-scale, real-world deployments of OODT, and the manner in 
which OODT helped us to address the domain-specific challenges induced by each 
deployment.",
        "grobid:header_AbstractHeader": "ABSTRACT",
        "grobid:header_Address": "Pasadena, CA 91109, USA Los Angeles, CA 
90089, USA",
        "grobid:header_Affiliation": "1 Jet Propulsion Laboratory California 
Institute of Technology ; 2 Computer Science Department University of Southern 
California",
        "grobid:header_Authors": "Chris A. Mattmann 1, 2 Daniel J. Crichton 1 
Nenad Medvidovic 2 Steve Hughes 1",
        "grobid:header_BeginPage": "-1",
        "grobid:header_Class": "class org.grobid.core.data.BiblioItem",
        "grobid:header_Email": 
"{dan.crichton,mattmann,steve.hughes}@jpl.nasa.gov ; {mattmann,neno}@usc.edu",
        "grobid:header_EndPage": "-1",
        "grobid:header_Error": "true",
        "grobid:header_FirstAuthorSurname": "Mattmann",
        "grobid:header_FullAffiliations": "[Affiliation{name='null', 
url='null', institutions=[California Institute of Technology], 
departments=null, laboratories=[Jet Propulsion Laboratory], country='USA', 
postCode='91109', postBox='null', region='CA', settlement='Pasadena', 
addrLine='null', marker='1', addressString='null', affiliationString='null', 
failAffiliation=false}, Affiliation{name='null', url='null', 
institutions=[University of Southern California], departments=[Computer Science 
Department], laboratories=null, country='USA', postCode='90089', 
postBox='null', region='CA', settlement='Los Angeles', addrLine='null', 
marker='2', addressString='null', affiliationString='null', 
failAffiliation=false}]",
        "grobid:header_FullAuthors": "[Chris A Mattmann, Daniel J Crichton, 
Nenad Medvidovic, Steve Hughes]",
        "grobid:header_Item": "-1",
        "grobid:header_Keyword": "Categories and Subject Descriptors D2 
Software Engineering, D211 Domain Specific Architectures Keywords OODT, Data 
Management, Software Architecture",
        "grobid:header_Keywords": "[D2 Software Engineering, D211 Domain 
Specific Architectures  (type:subject-headers), Keywords  
(type:subject-headers), OODT, Data Management, Software Architecture  
(type:subject-headers)]",
        "grobid:header_Language": "en",
        "grobid:header_NbPages": "-1",
        "grobid:header_OriginalAuthors": "Chris A. Mattmann 1, 2 Daniel J. 
Crichton 1 Nenad Medvidovic 2 Steve Hughes 1",
        "grobid:header_Title": "A Software Architecture-Based Framework for 
Highly Distributed and Data Intensive Scientific Applications",
        "meta:author": "End User Computing Services",
        "meta:creation-date": "2006-02-15T21:13:58Z",
        "meta:save-date": "2006-02-15T21:16:01Z",
        "modified": "2006-02-15T21:16:01Z",
        "pdf:PDFVersion": "1.4",
        "pdf:encrypted": "false",
        "producer": "Acrobat Distiller 6.0 (Windows)",
        "resourceName": "ICSE06.pdf",
        "title": "Proceedings Template - WORD",
        "xmp:CreatorTool": "Acrobat PDFMaker 6.0 for Word",
        "xmpTPg:NPages": "10"
    }
]
}}}

[Tika Wiki] Update of "GrobidJournalParser" by ChrisMattmann

Reply via email to