Hi Shahzeb,
The short answer is: not reliably. I've written about some of the
challenges of extracting structure and text from PDFs here:
https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf
If you're dealing with PDF/UA, you'll have decent luck -- and there's
more that we can do to improve that in Tika. However, if you're dealing
with PDFs in the wild, and you can't control how they're generated,
solutions are non-trivial (to say the least).
Tika does have hooks for Grobid (https://github.com/kermitt2/grobid),
which works fairly well on scholarly journals, and it is tuneable, but it
is not a general purpose solution.
For tables, you might take a look at https://tabula.technology/.
And, of course, there are commercial tools which may have some luck in
extracting the info you're interested in.
Please let us know if you have any follow up questions.
Best,
Tim
On Mon, Jun 27, 2022 at 8:08 AM Muhammad Shahzeb Ali <
[email protected]> wrote:
> Hey I am working for an org and I am using tiki-python.
> You guys did an amazing job.
>
> I am using tika to parse pdf and I want to ask that how can I get results
> in headings and their paragraphs. Can we make titles based on font size in
> pdf and write text below that heading? And so on? Can this work for tabular
> data in pdf too?
>
> I am awaiting your response!
> Thank you
>
> Shahzeb
> Data Scientist