Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "GrobidJournalParser" page has been changed by ChrisMattmann: https://wiki.apache.org/tika/GrobidJournalParser?action=diff&rev1=1&rev2=2 Comment: - wrap up example Now you can run GROBID via Tika-app with the following command on a sample PDF file. Note the order of the classpath - it is extremely important to keep the order as it allows Tika and its Jars to come first, and GROBID (and its large numbers of Jars) to come last. {{{ - java -classpath $HOME/git/grobidparser-resources/:$HOME/src/tika-app/target/tika-app-1.11-SNAPSHOT.jar:$HOME/grobid/lib/\* org.apache.tika.cli.TikaCLI --config=$HOME/git/grobidparser-resources/tika-config.xml -J $HOME/git/grobid/papers/ICSE06.pdf + java -classpath $HOME/src/grobidparser-resources/:tika-app-1.11-SNAPSHOT.jar:$HOME/src/grobid/lib/\* org.apache.tika.cli.TikaCLI --config=$HOME/src/grobidparser-resources/tika-config.xml -J $HOME/src/grobid/papers/ICSE06.pdf }}} Which should produce as output (e.g., if piped to `python -mjson.tool` for pretty printing): @@ -111, +111 @@ ] }}} + == Will this work from Tika Server? == + + It sure will! When you start Tika Server, use the following command, and ordering of the classpath is extremely important, as with Tika-app. + + {{{ + java -classpath $HOME/src/grobidparser-resources/:tika-server-1.11-SNAPSHOT.jar:$HOME/src/grobid/lib/\* org.apache.tika.server.TikaServerCli --config $HOME/src/grobidparser-resources/tika-config.xml + }}} + + Then, PUT a file to Tika-server like so: + + {{{ + curl -T $HOME/src/grobid/papers/ICSE06.pdf -H "Content-Disposition: attachment;filename=ICSE06.pdf" http://localhost:9998/rmeta + }}} + + Which will output (if e.g., using `python -mjson.tool`): + + {{{ + [ + { + "Author": "End User Computing Services", + "Company": "ACM", + "Content-Type": "application/pdf", + "Creation-Date": "2006-02-15T21:13:58Z", + "Last-Modified": "2006-02-15T21:16:01Z", + "Last-Save-Date": "2006-02-15T21:16:01Z", + "SourceModified": "D:20060215211344", + "X-Parsed-By": [ + "org.apache.tika.parser.CompositeParser", + "org.apache.tika.parser.journal.JournalParser" + ], + "X-TIKA:content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nProceedings Template - WORD\n\n\nA Software Architecture-Based Framework for Highly \nDistributed and Data Intensive Scientific Applications \n\n \nChris A. Mattmann1, 2 Daniel J. Crichton1 Nenad Medvidovic2 Steve Hughes1 \n\n \n1Jet Propulsion Laboratory \n\nCalifornia Institute of Technology \nPasadena, CA 91109, USA \n\n{dan.crichton,mattmann,steve.hughes}@jpl.nasa.gov \n\n2Computer Science Department \nUniversity of Southern California \n\nLos Angeles, CA 90089, USA \n{mattmann,neno}@usc.edu \n\n \nABSTRACT \nModern scientific research is increasingly conducted by virtual \ncommunities of scientists distributed around the world. The data \nvolumes created by these communities are extremely large, and \ngrowing rapidly. The management of the resulting highly \ndistributed, virtual data systems is a complex task, characterized \nby a number of formidable technical challenges, many of which \nare of a software engineering nature. In this paper we describe \nour experience over the past seven years in constructing and \ndeploying OODT, a software framework that supports large, \ndistributed, virtual scientific communities. We outline the key \nsoftware engineering challenges that we faced, and addressed, \nalong the way. We argue that a major contributor to the success of \nOODT was its explicit focus on software architecture. We \ndescribe several large-scale, real-world deployments of OODT, \nand the manner in which OODT helped us to address the domain-\nspecific challenges induced by each deployment. \n\nCategories and Subject Descriptors \nD.2 Software Engineering, D.2.11 Domain Specific Architectures \n\nKeywords \nOODT, Data Management, Software Architecture. \n\n1. INTRODUCTION \nSoftware systems of today are very large, highly complex, \n\noften widely distributed, increasingly decentralized, dynamic, and \nmobile. There are many causes behind this, spanning virtually all \nfacets of human endeavor: desired advances in education, \nentertainment, medicine, military technology, \ntelecommunications, transportation, and so on. \n\nOne major driver of software\u2019s growing complexity is \nscientific research and exploration. Today\u2019s scientists are solving \nproblems of until recently unimaginable complexity with the help \nof software. They also actively and regularly collaborate with \n\ncolleagues around the world, something that ..snip", + "X-TIKA:parse_time_millis": "12348", + "access_permission:assemble_document": "true", + "access_permission:can_modify": "true", + "access_permission:can_print": "true", + "access_permission:can_print_degraded": "true", + "access_permission:extract_content": "true", + "access_permission:extract_for_accessibility": "true", + "access_permission:fill_in_form": "true", + "access_permission:modify_annotations": "true", + "created": "Wed Feb 15 13:13:58 PST 2006", + "creator": "End User Computing Services", + "date": "2006-02-15T21:16:01Z", + "dc:creator": "End User Computing Services", + "dc:format": "application/pdf; version=1.4", + "dc:title": "Proceedings Template - WORD", + "dcterms:created": "2006-02-15T21:13:58Z", + "dcterms:modified": "2006-02-15T21:16:01Z", + "grobid:header_Abstract": "Modern scientific research is increasingly conducted by virtual communities of scientists distributed around the world. The data volumes created by these communities are extremely large, and growing rapidly. The management of the resulting highly distributed, virtual data systems is a complex task, characterized by a number of formidable technical challenges, many of which are of a software engineering nature. In this paper we describe our experience over the past seven years in constructing and deploying OODT, a software framework that supports large, distributed, virtual scientific communities. We outline the key software engineering challenges that we faced, and addressed, along the way. We argue that a major contributor to the success of OODT was its explicit focus on software architecture. We describe several large-scale, real-world deployments of OODT, and the manner in which OODT helped us to address the domain-specific challenges induced by each deployment.", + "grobid:header_AbstractHeader": "ABSTRACT", + "grobid:header_Address": "Pasadena, CA 91109, USA Los Angeles, CA 90089, USA", + "grobid:header_Affiliation": "1 Jet Propulsion Laboratory California Institute of Technology ; 2 Computer Science Department University of Southern California", + "grobid:header_Authors": "Chris A. Mattmann 1, 2 Daniel J. Crichton 1 Nenad Medvidovic 2 Steve Hughes 1", + "grobid:header_BeginPage": "-1", + "grobid:header_Class": "class org.grobid.core.data.BiblioItem", + "grobid:header_Email": "{dan.crichton,mattmann,steve.hughes}@jpl.nasa.gov ; {mattmann,neno}@usc.edu", + "grobid:header_EndPage": "-1", + "grobid:header_Error": "true", + "grobid:header_FirstAuthorSurname": "Mattmann", + "grobid:header_FullAffiliations": "[Affiliation{name='null', url='null', institutions=[California Institute of Technology], departments=null, laboratories=[Jet Propulsion Laboratory], country='USA', postCode='91109', postBox='null', region='CA', settlement='Pasadena', addrLine='null', marker='1', addressString='null', affiliationString='null', failAffiliation=false}, Affiliation{name='null', url='null', institutions=[University of Southern California], departments=[Computer Science Department], laboratories=null, country='USA', postCode='90089', postBox='null', region='CA', settlement='Los Angeles', addrLine='null', marker='2', addressString='null', affiliationString='null', failAffiliation=false}]", + "grobid:header_FullAuthors": "[Chris A Mattmann, Daniel J Crichton, Nenad Medvidovic, Steve Hughes]", + "grobid:header_Item": "-1", + "grobid:header_Keyword": "Categories and Subject Descriptors D2 Software Engineering, D211 Domain Specific Architectures Keywords OODT, Data Management, Software Architecture", + "grobid:header_Keywords": "[D2 Software Engineering, D211 Domain Specific Architectures (type:subject-headers), Keywords (type:subject-headers), OODT, Data Management, Software Architecture (type:subject-headers)]", + "grobid:header_Language": "en", + "grobid:header_NbPages": "-1", + "grobid:header_OriginalAuthors": "Chris A. Mattmann 1, 2 Daniel J. Crichton 1 Nenad Medvidovic 2 Steve Hughes 1", + "grobid:header_Title": "A Software Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications", + "meta:author": "End User Computing Services", + "meta:creation-date": "2006-02-15T21:13:58Z", + "meta:save-date": "2006-02-15T21:16:01Z", + "modified": "2006-02-15T21:16:01Z", + "pdf:PDFVersion": "1.4", + "pdf:encrypted": "false", + "producer": "Acrobat Distiller 6.0 (Windows)", + "resourceName": "ICSE06.pdf", + "title": "Proceedings Template - WORD", + "xmp:CreatorTool": "Acrobat PDFMaker 6.0 for Word", + "xmpTPg:NPages": "10" + } + ] + }}} +
