Hi,

I'm currently building the current version. I tried with an older one, and the ToC is there. Howewer it is after the titles, i.e. it's just the sequence that is wrong. That can happen with PDFs. If you want the correct order, you need to set the sort option in Tika.

Tilman

PS are you sure this PDF isn't somewhat confidential somehow?


ARTICLE 1
ARTICLE 2
ARTICLE 3
ARTICLE 4
ARTICLE 5
ARTICLE 6
ARTICLE 7
TABLE OF CONTENTS
Page
BASIC LEASE PROVISIONS .......................................................................................... 1 PREMISES; TERM; RENT .............................................................................................. 4 Section 2.1 Premises; Third and Fourth Core Elevator Lobbies ............................................ 4 Section 2.2 Commencement Date ......................................................................................... 5 Section 2.3 Payment of Rent; ................................................................................................ 5
...

Tilman

Am 06.07.2020 um 17:46 schrieb vijaya saradhi reddy:
Hi Team,

Kindly please look into the below issue and it would be great if you guys give a solution for this

Thanks,
Saradhi

---------- Forwarded message ---------
From: *vijaya saradhi reddy* <dsaradhire...@gmail.com <mailto:dsaradhire...@gmail.com>>
Date: Mon, Jul 6, 2020 at 8:25 PM
Subject: Unstructured Extraction by tika(Pdf)
To: <chris.a.mattm...@jpl.nasa.gov <mailto:chris.a.mattm...@jpl.nasa.gov>>


Hi Chris,
Please help me out from this troublesome issue
For Pdf data extraction i found tika as the best library comparing to pdf plumber and PyMUPdf(fitz), but i am facing a small issue while trying to extract data from below pdf
image.png


From the above pdf Apache tika extracting data like below image

image.png


Its extracting as above but i want my output as below image as it should extract as it is like in pdf. Below extraction results are using Pdf plumber, can i get the below result using apache tika. Please help me out from this as iam spending lots of time on tis

image.png


Reply via email to