[
https://issues.apache.org/jira/browse/PDFBOX-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029015#comment-14029015
]
Andreas Lehmkühler edited comment on PDFBOX-1512 at 6/12/14 10:48 AM:
----------------------------------------------------------------------
To avoid misunderstandings, IMHO the comparison itself isn't broken it works
well, but it breaks the contract of the sort algorithm of the Collections
framework.
The issue is that PDFBox not only uses the x,y values of a text position. In
some cases the context is taken into account if two positions are compared
which are neighbors. So that there are cases where there same combination of
x,y values may lead to different results if the sorting is done in another
order.
Saying that, it should be possible to simply replace the Collections.sort()
call with our own sort implementation (e.g. based on quicksort) using the very
same TestPositionComparator.
Maybe there is some place for an improvement:
The whole text is splitted into text postition, one for each character, so that
we have to sort all single characters. The information of text chunks/whole
words/lines of text got lost. We could preserve that information within the
TextPosition (number of chunk/ index within the chunk) to simplify the
comparison.
was (Author: lehmi):
To avoid misunderstandings, IMHO the comparison itself isn't broken it works
well, but it breaks the contract of the sort algorithm of the Collections
framework.
The issue is that PDFBox not only uses the x,y values of a text position. In
some cases the context is taken into account if two positions are compared
which are neighbors. So that there are cases where there same combination of
x,y values may lead to another result if the sorting is done in another order.
So, it should be possible to replace the Collections.sort() call with our own
sort implementation (e.g. based on quicksort) using the very same
TestPositionComparator.
Maybe there is some place for an improvement:
The whole text is splitted into text postition, one for each character, so that
we have to sort all single characters. The information of text chunks/whole
words/lines of text got lost. We could preserve that information within the
TextPosition (number of chunk/ index within the chunk) to simplify the
comparison.
> TextPositionComparator is not compatible with Java 7
> ----------------------------------------------------
>
> Key: PDFBOX-1512
> URL: https://issues.apache.org/jira/browse/PDFBOX-1512
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.7.1
> Environment: Java 7
> Reporter: Benjamin Papez
> Assignee: Andreas Lehmkühler
> Attachments: FOP-2252.pdf, TextPositionComparator.java, Topo.pdf,
> Topo.txt, TopoContained.pdf, TopoContained.txt, TopoOverlap.pdf,
> TopoOverlap.txt, WFI_PDFParser_TextPostionComparator.txt,
> illustration-of-inconsistent-sorting.png, immo-kurier_arsenal_93x62.pdf
>
>
> The TextPostionCompartor causes the following exception running on Java 7:
> Unexpected RuntimeException from
> org.apache.tika.parser.ParserDecorator$1@9007fa2 Original cause: Comparison
> method violates its general contract!
> I think the problem is with this check:
> if ( yDifference < .1 ||
> (pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom) ||
> (pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom))
> as it violates the contract requirement:
> The implementor must also ensure that the relation is transitive:
> ((compare(x, y)>0) && (compare(y, z)>0)) implies compare(x, z)>0.
> Finally, the implementor must ensure that compare(x, y)==0 implies that
> sgn(compare(x, z))==sgn(compare(y, z)) for all z.
> Java 7 now is strict and throws exceptions when the contract is violated.
--
This message was sent by Atlassian JIRA
(v6.2#6252)