Hi Richard,

The code below is derived from Chris' search engine class at USC (
http://www-scf.usc.edu/~csci572/). Hopefully it will point you in the right
direction.

        // Open all pdf files, process each one
        File pdfdir = new File("./some/pdf/directory");
        File[] pdfs = pdfdir.listFiles();
        for (File pdf:pdfs) {
            if (pdf.isFile()) processfile(pdf);
        }
//Your process method would look something like
    private void processfile(File f) {
        PDFParser parser = new PDFParser();
        Metadata metadata = new Metadata();
        FileInputStream fis = new FileInputStream(f);
        try {
            FileWriter writer = new FileWriter(f.getName() +
".content.txt");
            parser.parse(fis,
                    new BodyContentHandler(writer),
                    metadata,
                    new ParseContext());
            writer.flush();
            writer.close();
        } finally {
            fis.close();
        }
    }

You can find the details of the parse call here:
http://tika.apache.org/1.5/parser.html. Let me know if you have any
questions!

Hope that helps,
Tyler



On Thu, Jun 26, 2014 at 4:50 AM, Richard <[email protected]> wrote:

> Thanks very much Chris ... its all working now.
> You haven't by chance happen to have programmatically looped through a
> directory full of pdfs and used Tika to extract each of their pdf contents
> into separate text or xml files? If so, what do you recommend to do the
> extraction?
> Kind regards
> Richard
> > Date: Mon, 16 Jun 2014 23:03:49 -0700
> > Subject: Re: Question re installing Tika
> > From: [email protected]
> > To: [email protected]; [email protected]
> > CC: [email protected]
> >
> > Hi Richard,
> >
> > No problem at all, my attempted answers below:
> >
> >
> > -----Original Message-----
> > From: Richard <[email protected]>
> > Date: Monday, June 16, 2014 3:47 PM
> > To: Chris Mattmann <[email protected]>, "
> [email protected]"
> > <[email protected]>
> > Cc: "[email protected]" <[email protected]>
> > Subject: RE: Question re installing Tika
> >
> > >Thanks very much for responding to me, Chris. I hope you don't mind if I
> > >ask a few more questions about the setup process which I have done to
> > >date as follows (and by way of
> > > background I have a Windows 7 64 bit pc):
> > >
> > >
> > >1) I downloaded the tika-app-1.5.jar
> > ><http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.5.jar> from
> > >http://tika.apache.org/download.html
> > >2) I was recommended by a friend to rename it to tika-app.jar, which I
> > >have done, and placed it in my c:\Users\Myusername directory
> > >3) I added the environment variable JAVA_HOME (as a system variable).
> > >4) I then brought up the cmd window, changed directory to
> > >c:\Users\Myusername and typed in  "java -jar tika-app.jar"
> > >
> > >
> > >However the gui does not appear.
> >
> > Yep, if you type java -jar tika-app.jar --help, you'll see the command
> > line output and the switches.
> > I believe to pull the GUI up you need to do:
> >
> > java -jar tika-app.jar --gui
> >
> > >
> > >
> > >
> > >I have the latest version of Java: Version 7 Update 60 but I was
> > >wondering if I needed the Java SDK to run this?
> > >
> > >
> > >Many thanks again for your help
> >
> > No problem, see above :)
> >
> > Cheers,
> > Chris
> >
> > >
> > >
> > >Richard
> > >
> > >
> > >> From: [email protected]
> > >> To: [email protected]; [email protected]
> > >> CC: [email protected]
> > >> Subject: Re: Question re installing Tika
> > >> Date: Thu, 12 Jun 2014 03:27:20 +0000
> > >>
> > >> Hi Richard,
> > >>
> > >> Hope you are well, will try and answer below:
> > >>
> > >>
> > >> -----Original Message-----
> > >>
> > >> From: Richard <[email protected]>
> > >> Date: Friday, June 6, 2014 6:07 AM
> > >> To: "[email protected]" <[email protected]>,
> > >> "[email protected]" <[email protected]>
> > >> Subject: Question re installing Tika
> > >>
> > >> >Hello
> > >> >
> > >> >I am new to the Apache suite of products and dealing with text in
> pdfs,
> > >> >more generally. In particular I am trying to install Tika (the
> > >> >tika-app_1.5.jar) as well as Solr on my Windows 7 pc.
> > >> >
> > >> >
> > >> >However I am confused about how to do the Tika installation.
> > >> >
> > >> >
> > >> >From reading various webpages (eg
> > >> >http://tika.apache.org/1.5/gettingstarted.html
> > >> ><http://tika.apache.org/1.5/gettingstarted.html>) it seems I need to
> > >> >
> > >> >1)
> > >> >Download the .jar from
> > >> >http://tika.apache.org/download.html
> > >> ><http://tika.apache.org/download.html> (do I need to put it in a
> > >>specific
> > >> >windows folder?)
> > >>
> > >> Nope you don't have to put in any specific folder, wherever you are
> > >> comfortable calling the jar from.
> > >>
> > >> >2)
> > >> >Download Maven 2 (from http://maven.apache.org/ ) and follow up the
> > >> >instructions for Windows on
> > >> >http://maven.apache.org/download.cgi#Installation
> > >>
> > >> No need to do this unless you are building from scratch.
> > >>
> > >> >3)
> > >> >Also where do I set the base directory?
> > >>
> > >> You just need to install Apache Tika and its *-app.jar file into some
> > >> folder, and then
> > >> call it by doing java -jar /path/to/tika-*version*-app.jar --help
> > >>
> > >> >
> > >> >4)
> > >> >Where do I run the command ³mvn install² from? Is it the command
> line?
> > >>
> > >> If you are building from source, then you would run this at the top
> > >>level
> > >> directory containing
> > >> files like pom.xml, tika-parent, tika-parsers, etc.
> > >>
> > >> >
> > >> >
> > >> >Any help would be most gratefully received.
> > >>
> > >> Cheers!
> > >>
> > >> Chris
> > >>
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> Chris Mattmann, Ph.D.
> > >> Chief Architect
> > >> Instrument Software and Science Data Systems Section (398)
> > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >> Office: 168-519, Mailstop: 168-527
> > >> Email: [email protected]
> > >> WWW: http://sunset.usc.edu/~mattmann/
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> Adjunct Associate Professor, Computer Science Department
> > >> University of Southern California, Los Angeles, CA 90089 USA
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>
> > >>
> > >>
> > >>
> > >> >
> > >>
> > >
> > >
> > >
> > >
> >
> >
>
>

Reply via email to