Hi,

 

Why not just do an os.walk or os.listdir in python, and then for each file, 
call Tika, e.g., 

 

import os

import json

from tika import parser

 

fs = os.listdir(‘/some/path’)

fs = [f for f in fs if os.isfile(f) and (str(f).endswith(‘.pdf’) or 
str(f).endswith(‘.doc’))]

 

for f in fs:

                parsed = parser.from_file(f)

                # save parsed to file

                json.dump(parsed, ‘/some/other/path’)

 

Cheers,

Chris

 

 

 

From: Victor Olaiya <[email protected]>
Date: Monday, August 19, 2019 at 8:28 AM
To: "Mattmann, Chris A (US 1761)" <[email protected]>
Subject: [EXTERNAL] Urgent!!! Tika-python

 

Hello, 

I sent a mail to the mailing list with no response, so I decided to mail you 
again.

I have been trying to extract text from all pdfs and doc etc files in a 
directory and that has been impossible as Tika-python does not allow parsing of 
directory only files.

I was able to compress the files in a single zip file and extract, this worked 
but the extracted text where saved in a single file, i need the files to be 
saved in their individual files so I can use them as input to another program.

 

Please what is the best method to go about this.

Thank you Chris Mattmann,

I await your reply.

Reply via email to