[ https://issues.apache.org/jira/browse/PDFBOX-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435638#comment-17435638 ]
Alistair Oldfield edited comment on PDFBOX-5308 at 10/28/21, 7:31 PM: ---------------------------------------------------------------------- I have noticed when using documents in the wild, that QuickSort is *occasionally* on fire when running with profiling on. This led me down the sorting rabbit hole where I made a few changes and have noticed a subjective improvement. When we are talking about a single document, then the improvement is obviously negligible, however in larger apps, running in parallel batches, I believe the difference is real. In response to your question, I patched together a very quick and dirty benchmark test. Not exhaustive in any way. I am not measuring time, but rather comparisons when sorting. Extending the existing position Comparator: {code:java} public class TextPositionCountComparator extends TextPositionComparator { public int count = 0; @Override public int compare(TextPosition pos1, TextPosition pos2) { count++; return super.compare(pos1, pos2); } } {code} then modifying PDFTextStripper.writePage(): {code:java} // because the TextPositionComparator is not transitive, but // JDK7+ enforces transitivity on comparators, we need to use // a custom quicksort implementation (which is slower, unfortunately). /* if (useCustomQuickSort) { QuickSort.sort(textList, comparator); } else { Collections.sort(textList, comparator); } */ TextPositionCountComparator cmp1 = new TextPositionCountComparator(); SortUtils.sort(textList, cmp1); System.out.println("NativeSort: " + cmp1.count); TextPositionCountComparator cmp2 = new TextPositionCountComparator(); QuickSort.sort(textList, cmp2); System.out.println("QuickSort: " + cmp2.count); TextPositionCountComparator cmp3 = new TextPositionCountComparator(); MergeSort.sort(textList, cmp3); System.out.println("MergeSort: " + cmp3.count); {code} Then in the Driver: {code:java} String text = stripper.getText(doc); System.out.println("TEXT LENGTH: "+text.length()); System.out.println("------------------------");{code} I ran this on 48 PDFs I had handy. Attached is the console out. For each page of each doc you'll see something like: NativeSort: 3264 QuickSort: 35282 MergeSort: 16272 And then at the end of each PDF text extraction something like: TEXT LENGTH: 33617 ------------------------ The attached console out has about 8K lines, but they mostly roughly follow the pattern above in terms of numbers. [^sort-benchmark.txt] was (Author: alistairo): I have noticed when using documents in the wild, that QuickSort is on fire when running with profiling on. This led me down the sorting rabbit hole where I made a few changes and have noticed a subjective improvement. When we are talking about a single document, then the improvement is obviously negligible, however in larger apps, running in parallel batches, I believe the difference is real. In response to your question, I patched together a very quick and dirty benchmark test. Not exhaustive in any way. I am not measuring time, but rather comparisons when sorting. Extending the existing position Comparator: {code:java} public class TextPositionCountComparator extends TextPositionComparator { public int count = 0; @Override public int compare(TextPosition pos1, TextPosition pos2) { count++; return super.compare(pos1, pos2); } } {code} then modifying PDFTextStripper.writePage(): {code:java} // because the TextPositionComparator is not transitive, but // JDK7+ enforces transitivity on comparators, we need to use // a custom quicksort implementation (which is slower, unfortunately). /* if (useCustomQuickSort) { QuickSort.sort(textList, comparator); } else { Collections.sort(textList, comparator); } */ TextPositionCountComparator cmp1 = new TextPositionCountComparator(); SortUtils.sort(textList, cmp1); System.out.println("NativeSort: " + cmp1.count); TextPositionCountComparator cmp2 = new TextPositionCountComparator(); QuickSort.sort(textList, cmp2); System.out.println("QuickSort: " + cmp2.count); TextPositionCountComparator cmp3 = new TextPositionCountComparator(); MergeSort.sort(textList, cmp3); System.out.println("MergeSort: " + cmp3.count); {code} Then in the Driver: {code:java} String text = stripper.getText(doc); System.out.println("TEXT LENGTH: "+text.length()); System.out.println("------------------------");{code} I ran this on 48 PDFs I had handy. Attached is the console out. For each page of each doc you'll see something like: NativeSort: 3264 QuickSort: 35282 MergeSort: 16272 And then at the end of each PDF text extraction something like: TEXT LENGTH: 33617 ------------------------ The attached console out has about 8K lines, but they mostly roughly follow the pattern above in terms of numbers. [^sort-benchmark.txt] > Prefer MergeSort over QuickSort and try native TimSort first (with > explanation) > ------------------------------------------------------------------------------- > > Key: PDFBOX-5308 > URL: https://issues.apache.org/jira/browse/PDFBOX-5308 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction > Affects Versions: 2.0.24 > Reporter: Alistair Oldfield > Priority: Minor > Attachments: sort-benchmark.txt > > > Hello! > I have 2 related proposals for improvement (so I am logging as one ticket, I > am happy to split them if requested, however). > 1) > Propose to use an iterative MergeSort implementation over an iterative > QuickSort implementation in TextPosition sorting during text extraction. > The reason: while QuickSort and MergeSort generally perform O(NlogN), and > while MergeSort uses slightly more space (O(N)), QuickSort's worst-case > performance is O(n2) and this case will, unfortunately, happen in (arguably) > very common cases during text extraction: when the list is already sorted (or > nearly sorted). Many PDFs will already stream text in the order of sorting. > In many use-cases, QuickSort is among the worst-performing sorting algorithms > for this particular task. > This is why java's native sort on Collections (using TimSort - which assumes > a high frequency of pre-sorted lists in real-world data) is ideal for many > PDF text extraction sorting scenarios, and is a shame it is not being used. > This brings me to the next part of the improvement proposal: > 2) > As the TextPositionComparator is not transitive, the current implementation > to avoid JDK7+ enforcing transitivity, and throwing the "Comparison method > violates its general contract" IllegalArgumentException is to check for java > version, and if lower than 7, then ALWAYS use QuickSort. > As TimSort often "approaches" O(N) for many PDF extraction use-cases, and as > TextPositionComparator will not always cause an IllegalArgumentException > during many sorting tasks, I propose to use java's native sort (which is > relatively cheap), and catch the IllegalArgumentException (in many cases will > be thrown early - not late), and then use whichever preferred "legacy" > sorting algorithm on that failure. > Here is my implementation to show what I mean: > > > {code:java} > import java.util.Collections; > import java.util.Comparator; > import java.util.List; > public class SortUtils { > public static <T> void sort(List<T> list, Comparator<T> comparator) { > try{ > Collections.sort(list, comparator); > }catch(java.lang.IllegalArgumentException e) { > MergeSort.sort(list, comparator); > } > } > } > > {code} > This approach (even when used on jdk7+) has a significant impact on sorting > costs on real-world PDFs (even with the occasional "double-attempt"). > I am also happy to share the MergeSort implementation I am using (which > follows the sort contract for generics). It is iterative (not recursive). It > is borrowed from here: > [https://github.com/vlab-cs-ucsb/cashew/blob/master/jpf-security/src/examples/benchmarks/MergeSortIterative.java] > > It has some slight tweaks (uses System.arraycopy, etc,) and is > "generic-ified". Please just let me know. > I am also curious to know any reasons (which I may have missed), why > QuickSort is the preferred sort (if the fact that many lists are mostly > pre-sorted is already known and considered)? > Thanks! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org