[jira] [Comment Edited] (TIKA-4654) Experiment with docstrum for clustering TextPositions for PDFs

Tim Allison (Jira) Tue, 10 Feb 2026 12:07:44 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-4654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057678#comment-18057678
 ]


Tim Allison edited comment on TIKA-4654 at 2/10/26 8:06 PM:
------------------------------------------------------------

I've been iterating with claude on this a bit. We're using realworld PDFs from 
[this 
corpus|https://digitalcorpora.s3.amazonaws.com/s3_browser.html#corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/]
 that have marked content as the semi-gold standard, or at least a noisy input. 
Some of the PDFs have absolutely pathological markup + tags.

I'm happy to have punted on this until AI can just solve it.

All kidding aside, no amount of AI short of VLMs will solve this problem, but I 
think we can do a better job. This will be an optional opt-in mode for the PDF 
parser.


was (Author: [email protected]):
I've been iterating with claude on this a bit. We're using realworld PDFs from 
[this 
corpus|https://digitalcorpora.s3.amazonaws.com/s3_browser.html#corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/]
 that have marked content as the semi-gold standard, or at least a noisy input.

I'm happy to have punted on this until AI can just solve it.

All kidding aside, no amount of AI short of VLMs will solve this problem, but I 
think we can do a better job. This will be an optional opt-in mode for the PDF 
parser.

> Experiment with docstrum for clustering TextPositions for PDFs
> --------------------------------------------------------------
>
>                 Key: TIKA-4654
>                 URL: https://issues.apache.org/jira/browse/TIKA-4654
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> We currently allow users to turn on {{detectAngles}} and/or switch on 
> {{sortByPosition}}. We should experiment with other methods for clustering 
> text positions... perhaps add heuristics based on the clusters to identify 
> headers and footers?
> Docstrum is one (aged) approach. There are likely more modern versions.
> While vlms are amazing, we should see if we can improve on our current 
> options for rebuilding the cow from the hamburger of PDFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (TIKA-4654) Experiment with docstrum for clustering TextPositions for PDFs

Reply via email to