[jira] [Commented] (TIKA-3118) PDFParser: totalCharsPerPage vs. actual chars per page after parsing

Jeroen Steggink (Jira) Fri, 19 Jun 2020 15:41:15 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140872#comment-17140872
 ]


Jeroen Steggink commented on TIKA-3118:
---------------------------------------

That's a great suggestion Tim! I hadn't seen that one. Solves the same issue 
without changing metadata. However, I do think the pdf:charsPerPage in the 
metadata is somewhat confusing, as it's a not documented, or at least, not that 
I could find.

Maybe I can write something about charsPerPage and also how to parse pages 
using a custom ContentHandler for PDFs on Confluence or somewhere else?

> PDFParser: totalCharsPerPage vs. actual chars per page after parsing
> --------------------------------------------------------------------
>
>                 Key: TIKA-3118
>                 URL: https://issues.apache.org/jira/browse/TIKA-3118
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.24
>            Reporter: Jeroen Steggink
>            Priority: Minor
>
> While parsing a PDF document I'd like to know the actual characters per page 
> that are produced, not which are in the document itself. While the 
> totalCharsPerPage (as defined in the class AbstractPDF2HTML) could be 
> interesting to know how many characters there are, for actually using 
> extracted text, it could be of more use to know what the actual number is. 
> Currently the only part missing to a real count, is incorporating the added 
> word spacing and line separators.
> I propose to create another attribute (parsedCharsPerPage or extracted) and 
> have an increment in the following methods in PDF2XHTML
> writeCharacters, writeWordSeparator and writeLineSeparator.
> One use case would be to be able to split the content written in a 
> ContentHandler, because you have an actual truth about the number of 
> characters written for a page.
> What do you think?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3118) PDFParser: totalCharsPerPage vs. actual chars per page after parsing

Reply via email to