[
https://issues.apache.org/jira/browse/TIKA-4654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057678#comment-18057678
]
Tim Allison edited comment on TIKA-4654 at 2/10/26 8:06 PM:
------------------------------------------------------------
I've been iterating with claude on this a bit. We're using realworld PDFs from
[this
corpus|https://digitalcorpora.s3.amazonaws.com/s3_browser.html#corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/]
that have marked content as the semi-gold standard, or at least a noisy input.
Some of the PDFs have absolutely pathological markup + tags.
I'm happy to have punted on this until AI can just solve it.
All kidding aside, no amount of AI short of VLMs will solve this problem, but I
think we can do a better job. This will be an optional opt-in mode for the PDF
parser.
was (Author: [email protected]):
I've been iterating with claude on this a bit. We're using realworld PDFs from
[this
corpus|https://digitalcorpora.s3.amazonaws.com/s3_browser.html#corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/]
that have marked content as the semi-gold standard, or at least a noisy input.
I'm happy to have punted on this until AI can just solve it.
All kidding aside, no amount of AI short of VLMs will solve this problem, but I
think we can do a better job. This will be an optional opt-in mode for the PDF
parser.
> Experiment with docstrum for clustering TextPositions for PDFs
> --------------------------------------------------------------
>
> Key: TIKA-4654
> URL: https://issues.apache.org/jira/browse/TIKA-4654
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
>
> We currently allow users to turn on {{detectAngles}} and/or switch on
> {{sortByPosition}}. We should experiment with other methods for clustering
> text positions... perhaps add heuristics based on the clusters to identify
> headers and footers?
> Docstrum is one (aged) approach. There are likely more modern versions.
> While vlms are amazing, we should see if we can improve on our current
> options for rebuilding the cow from the hamburger of PDFs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)