Hi Richard,

I forgot to mention, if you're not going to be contributing to Tika, you
don't need to install directly from source. You can just add the following
entry to your pom.xml file:

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>1.5</version>
 </dependency>

Tyler


On Fri, Jun 27, 2014 at 2:44 AM, Tyler Palsulich <[email protected]>
wrote:

> Hi Richard,
>
> The code below is derived from Chris' search engine class at USC (
> http://www-scf.usc.edu/~csci572/). Hopefully it will point you in the
> right direction.
>
>         // Open all pdf files, process each one
>         File pdfdir = new File("./some/pdf/directory");
>         File[] pdfs = pdfdir.listFiles();
>         for (File pdf:pdfs) {
>             if (pdf.isFile()) processfile(pdf);
>         }
> //Your process method would look something like
>     private void processfile(File f) {
>         PDFParser parser = new PDFParser();
>         Metadata metadata = new Metadata();
>         FileInputStream fis = new FileInputStream(f);
>         try {
>             FileWriter writer = new FileWriter(f.getName() +
> ".content.txt");
>             parser.parse(fis,
>                     new BodyContentHandler(writer),
>                     metadata,
>                     new ParseContext());
>             writer.flush();
>             writer.close();
>         } finally {
>             fis.close();
>         }
>     }
>
> You can find the details of the parse call here:
> http://tika.apache.org/1.5/parser.html. Let me know if you have any
> questions!
>
> Hope that helps,
> Tyler
>
>
>
> On Thu, Jun 26, 2014 at 4:50 AM, Richard <[email protected]> wrote:
>
>> Thanks very much Chris ... its all working now.
>> You haven't by chance happen to have programmatically looped through a
>> directory full of pdfs and used Tika to extract each of their pdf contents
>> into separate text or xml files? If so, what do you recommend to do the
>> extraction?
>> Kind regards
>> Richard
>> > Date: Mon, 16 Jun 2014 23:03:49 -0700
>> > Subject: Re: Question re installing Tika
>> > From: [email protected]
>> > To: [email protected]; [email protected]
>> > CC: [email protected]
>> >
>> > Hi Richard,
>> >
>> > No problem at all, my attempted answers below:
>> >
>> >
>> > -----Original Message-----
>> > From: Richard <[email protected]>
>> > Date: Monday, June 16, 2014 3:47 PM
>> > To: Chris Mattmann <[email protected]>, "
>> [email protected]"
>> > <[email protected]>
>> > Cc: "[email protected]" <[email protected]>
>> > Subject: RE: Question re installing Tika
>> >
>> > >Thanks very much for responding to me, Chris. I hope you don't mind if
>> I
>> > >ask a few more questions about the setup process which I have done to
>> > >date as follows (and by way of
>> > > background I have a Windows 7 64 bit pc):
>> > >
>> > >
>> > >1) I downloaded the tika-app-1.5.jar
>> > ><http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.5.jar> from
>> > >http://tika.apache.org/download.html
>> > >2) I was recommended by a friend to rename it to tika-app.jar, which I
>> > >have done, and placed it in my c:\Users\Myusername directory
>> > >3) I added the environment variable JAVA_HOME (as a system variable).
>> > >4) I then brought up the cmd window, changed directory to
>> > >c:\Users\Myusername and typed in  "java -jar tika-app.jar"
>> > >
>> > >
>> > >However the gui does not appear.
>> >
>> > Yep, if you type java -jar tika-app.jar --help, you'll see the command
>> > line output and the switches.
>> > I believe to pull the GUI up you need to do:
>> >
>> > java -jar tika-app.jar --gui
>> >
>> > >
>> > >
>> > >
>> > >I have the latest version of Java: Version 7 Update 60 but I was
>> > >wondering if I needed the Java SDK to run this?
>> > >
>> > >
>> > >Many thanks again for your help
>> >
>> > No problem, see above :)
>> >
>> > Cheers,
>> > Chris
>> >
>> > >
>> > >
>> > >Richard
>> > >
>> > >
>> > >> From: [email protected]
>> > >> To: [email protected]; [email protected]
>> > >> CC: [email protected]
>> > >> Subject: Re: Question re installing Tika
>> > >> Date: Thu, 12 Jun 2014 03:27:20 +0000
>> > >>
>> > >> Hi Richard,
>> > >>
>> > >> Hope you are well, will try and answer below:
>> > >>
>> > >>
>> > >> -----Original Message-----
>> > >>
>> > >> From: Richard <[email protected]>
>> > >> Date: Friday, June 6, 2014 6:07 AM
>> > >> To: "[email protected]" <[email protected]>,
>> > >> "[email protected]" <[email protected]>
>> > >> Subject: Question re installing Tika
>> > >>
>> > >> >Hello
>> > >> >
>> > >> >I am new to the Apache suite of products and dealing with text in
>> pdfs,
>> > >> >more generally. In particular I am trying to install Tika (the
>> > >> >tika-app_1.5.jar) as well as Solr on my Windows 7 pc.
>> > >> >
>> > >> >
>> > >> >However I am confused about how to do the Tika installation.
>> > >> >
>> > >> >
>> > >> >From reading various webpages (eg
>> > >> >http://tika.apache.org/1.5/gettingstarted.html
>> > >> ><http://tika.apache.org/1.5/gettingstarted.html>) it seems I need
>> to
>> > >> >
>> > >> >1)
>> > >> >Download the .jar from
>> > >> >http://tika.apache.org/download.html
>> > >> ><http://tika.apache.org/download.html> (do I need to put it in a
>> > >>specific
>> > >> >windows folder?)
>> > >>
>> > >> Nope you don't have to put in any specific folder, wherever you are
>> > >> comfortable calling the jar from.
>> > >>
>> > >> >2)
>> > >> >Download Maven 2 (from http://maven.apache.org/ ) and follow up the
>> > >> >instructions for Windows on
>> > >> >http://maven.apache.org/download.cgi#Installation
>> > >>
>> > >> No need to do this unless you are building from scratch.
>> > >>
>> > >> >3)
>> > >> >Also where do I set the base directory?
>> > >>
>> > >> You just need to install Apache Tika and its *-app.jar file into some
>> > >> folder, and then
>> > >> call it by doing java -jar /path/to/tika-*version*-app.jar --help
>> > >>
>> > >> >
>> > >> >4)
>> > >> >Where do I run the command ³mvn install² from? Is it the command
>> line?
>> > >>
>> > >> If you are building from source, then you would run this at the top
>> > >>level
>> > >> directory containing
>> > >> files like pom.xml, tika-parent, tika-parsers, etc.
>> > >>
>> > >> >
>> > >> >
>> > >> >Any help would be most gratefully received.
>> > >>
>> > >> Cheers!
>> > >>
>> > >> Chris
>> > >>
>> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > >> Chris Mattmann, Ph.D.
>> > >> Chief Architect
>> > >> Instrument Software and Science Data Systems Section (398)
>> > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> > >> Office: 168-519, Mailstop: 168-527
>> > >> Email: [email protected]
>> > >> WWW: http://sunset.usc.edu/~mattmann/
>> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > >> Adjunct Associate Professor, Computer Science Department
>> > >> University of Southern California, Los Angeles, CA 90089 USA
>> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> >
>> > >>
>> > >
>> > >
>> > >
>> > >
>> >
>> >
>>
>>
>
>

Reply via email to