[ https://issues.apache.org/jira/browse/TIKA-3118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140872#comment-17140872 ]
Jeroen Steggink commented on TIKA-3118: --------------------------------------- That's a great suggestion Tim! I hadn't seen that one. Solves the same issue without changing metadata. However, I do think the pdf:charsPerPage in the metadata is somewhat confusing, as it's a not documented, or at least, not that I could find. Maybe I can write something about charsPerPage and also how to parse pages using a custom ContentHandler for PDFs on Confluence or somewhere else? > PDFParser: totalCharsPerPage vs. actual chars per page after parsing > -------------------------------------------------------------------- > > Key: TIKA-3118 > URL: https://issues.apache.org/jira/browse/TIKA-3118 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.24 > Reporter: Jeroen Steggink > Priority: Minor > > While parsing a PDF document I'd like to know the actual characters per page > that are produced, not which are in the document itself. While the > totalCharsPerPage (as defined in the class AbstractPDF2HTML) could be > interesting to know how many characters there are, for actually using > extracted text, it could be of more use to know what the actual number is. > Currently the only part missing to a real count, is incorporating the added > word spacing and line separators. > I propose to create another attribute (parsedCharsPerPage or extracted) and > have an increment in the following methods in PDF2XHTML > writeCharacters, writeWordSeparator and writeLineSeparator. > One use case would be to be able to split the content written in a > ContentHandler, because you have an actual truth about the number of > characters written for a page. > What do you think? -- This message was sent by Atlassian Jira (v8.3.4#803005)