[ 
https://issues.apache.org/jira/browse/PDFBOX-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029015#comment-14029015
 ] 

Andreas Lehmkühler commented on PDFBOX-1512:
--------------------------------------------

To avoid misunderstandings, IMHO the comparison itself isn't broken it works 
well, but it breaks the contract of the sort algorithm of the Collections 
framework.

The issue is that PDFBox not only uses the x,y values of a text position. In 
some cases the context is taken into account if two positions are compared 
which are neighbors. So that there are cases where there same combination of 
x,y values may lead to another result if the sorting is done in another order.

So, it should be possible to replace the Collections.sort() call with our own 
sort implementation (e.g. based on quicksort) using the very same 
TestPositionComparator.

Maybe there is some place for an improvement: 
The whole text is splitted into text postition, one for each character, so that 
we have to sort all single characters. The information of text chunks/whole 
words/lines of text got lost. We could preserve that information within the 
TextPosition (number of chunk/ index within the chunk) to simplify the 
comparison.


> TextPositionComparator is not compatible with Java 7
> ----------------------------------------------------
>
>                 Key: PDFBOX-1512
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1512
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.7.1
>         Environment: Java 7
>            Reporter: Benjamin Papez
>            Assignee: Andreas Lehmkühler
>         Attachments: FOP-2252.pdf, TextPositionComparator.java, Topo.pdf, 
> Topo.txt, TopoContained.pdf, TopoContained.txt, TopoOverlap.pdf, 
> TopoOverlap.txt, WFI_PDFParser_TextPostionComparator.txt, 
> illustration-of-inconsistent-sorting.png, immo-kurier_arsenal_93x62.pdf
>
>
> The TextPostionCompartor causes the following exception running on Java 7: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.ParserDecorator$1@9007fa2 Original cause: Comparison 
> method violates its general contract!
> I think the problem is with this check:
> if ( yDifference < .1 ||
>     (pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom) ||
>     (pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom))
> as it violates the contract requirement:
> The implementor must also ensure that the relation is transitive: 
> ((compare(x, y)>0) && (compare(y, z)>0)) implies compare(x, z)>0.
> Finally, the implementor must ensure that compare(x, y)==0 implies that 
> sgn(compare(x, z))==sgn(compare(y, z)) for all z.
> Java 7 now is strict and throws exceptions when the contract is violated.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to