[ https://issues.apache.org/jira/browse/PDFBOX-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030659#comment-15030659 ]
Lars Torunski commented on PDFBOX-2996: --------------------------------------- I can reproduce my tests results as documented in diff-delta.png. Using WinDiff to look into the differences between the results I can see the weird differences and the glyphs that none of us both understand also. In my opinion the number of the deltas can be used as an measurement of the sorting algorithms. And when legacy merge sort is the base line, which was used until PDFBOX-1512, then bubble sort should be used as a substitution of it. Otherwise the iterative quick sort with choosing the right index for the pivot is the best choice and substitution for the current recursive quick sort. This would solve the issue PDFBOX-2996, but you should reminder that Java 5&6 are having different text extraction results than Java 7+ on certain PDF files. > StackOverflow in Quicksort > -------------------------- > > Key: PDFBOX-2996 > URL: https://issues.apache.org/jira/browse/PDFBOX-2996 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.8.10, 2.0.0 > Environment: Java 7 > Reporter: Manuel Aristaran > Attachments: 001991.pdf, Lars-v0-PDFBOX-2996.patch, > Lars-v1-PDFBOX-2996.patch, Lars-v2-PDFBOX-2996.patch, QuickSort.java, > TestSortingAlgorithms.java, artikel1_20_arab.pdf-sorted-bubble.txt, > artikel1_20_arab.pdf-sorted-diff.txt, > artikel1_20_arab.pdf-sorted-iter-withRightPivot.txt, > artikel1_20_arab.pdf-sorted-iter.txt, > artikel1_20_arab.pdf-sorted-java8-legacyMergeSort.txt, > artikel1_20_arab.pdf-sorted-java8-timsort.txt, > artikel1_20_arab.pdf-sorted-qs-iterative-withMiddlePivot.txt, > artikel1_20_arab.pdf-sorted-qs-iterative-withRightPivot.txt, > artikel1_20_arab.pdf-sorted-qs-recursive.txt, > artikel1_20_arab.pdf-sorted-rekur.txt, diff-delta.png, failing_sort.pdf, > quicksort.patch > > > Running PDFTextStripper through ExtractText triggers a StackOverflow > exception in the QuickSort implementation for [this particular > document|https://www.dropbox.com/s/6crie7y5gqadwa5/1.pdf?dl=0]. > To reproduce: {{java -jar pdfbox-app-1.8.11-SNAPSHOT.jar ExtractText -sort > failing_sort.pdf}} > (Related to PDFBOX-1512) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org