[
https://issues.apache.org/jira/browse/PDFBOX-4883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147229#comment-17147229
]
Alfred commented on PDFBOX-4883:
--------------------------------
One aspect got lost.
I was only keeping the original string representation _only_ if it was a valid
float.
It is possible that it is not valid, and it could be replaced with a valid one:
1. The constructor could replace the value of trying to work around some known
bugs
2. The method checkMinMaxValues could also replace the float if it was too
small or too large.
In both cases, the new value does not correspond to the original string rep.
I suggest we only keep the original string if it was valid, otherwise leave it
null.
The method formatString will lazily build a new correct string rep later, if
needed.
I have also removed the escaping of the '-' char in the regex. It does not have
to be escaped.
New patch: [^PDFBOX-4883-b.patch]
Review: https://diffy.org/diff/lgrkboxc9xtg7zgdct2t65hfr
> COSFloat is extremely slow
> --------------------------
>
> Key: PDFBOX-4883
> URL: https://issues.apache.org/jira/browse/PDFBOX-4883
> Project: PDFBox
> Issue Type: Bug
> Components: PDModel
> Affects Versions: 2.0.20, 3.0.0 PDFBox
> Reporter: Alfred
> Assignee: Andreas Lehmkühler
> Priority: Major
> Labels: display, optimization, parsing, textextraction
> Fix For: 3.0.0 PDFBox
>
> Attachments: After.png, Before.png, PDFBOX-4883-b.patch,
> PDFBOX-4883-b.patch, PDFBOX-4883.patch, extreme-values-out.pdf
>
>
> I am testing text extraction from PDF and profiling the execution.
> I found that biggest time consumer is the COSFloat class.
>
> All other improvements I suggested so far are small compared to this.
> But this is the also the most complex one.
>
> I have attached te profiler output for the same text extraction, with and
> without the COSFloat changes.
> The time to extract the same text was 4 times long with the original COSFlow,
> because of its use of BigDecimal.
> I will try to write extra tests for all cases I see in the original COSFLoat
> code, if they are not already tested.
> Then I will submit for review a new COSFloat version.
>
> I think this affects parsing and displaying PDFs too, not just text
> extraction.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]