[jira] [Created] (PDFBOX-5308) Prefer MergeSort over QuickSort and try native TimSort first (with explanation)

Alistair Oldfield (Jira) Wed, 27 Oct 2021 01:11:11 -0700

Alistair Oldfield created PDFBOX-5308:
-----------------------------------------


             Summary: Prefer MergeSort over QuickSort and try native TimSort 
first (with explanation)
                 Key: PDFBOX-5308
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5308
             Project: PDFBox
          Issue Type: Improvement
          Components: Text extraction
    Affects Versions: 2.0.24
            Reporter: Alistair Oldfield


Hello!

I have 2 related proposals for improvement (so I am logging as one ticket, I am 
happy to split them if requested, however).

1)

Propose to use an iterative MergeSort implementation over an iterative 
QuickSort implementation in TextPosition sorting during text extraction.

The reason: while QuickSort and MergeSort generally perform O(NlogN), and while 
MergeSort uses slightly more space (O(n)), QuickSort's worst-case performance 
is O(n2) and this case will, unfortunately, happen in (arguably) very common 
cases during text extraction: when the list is already sorted (or nearly 
sorted). Many PDFs will already stream text in the order of sorting. In many 
use-cases, QuickSort is among the worst-performing sorting algorithms for this 
particular task.

This is why java's native sort on Collections (using TimSort - which assumes a 
high frequency of pre-sorted lists in real-world data) is ideal for many PDF 
text extraction sorting scenarios, and is a shame it is not being used. This 
brings me to the next part of the improvement proposal:

2) 

As the TextPositionComparator is not transitive, the current implementation to 
avoid JDK7+ enforcing transitivity, and throwing the "Comparison method 
violates its general contract" IllegalArgumentException is to check for java 
version, and if lower than 7, then ALWAYS use QuickSort.

As TimSort often "approaches" O(n) for many PDF extraction use-cases, and as 
TextPositionComparator will not always cause an IllegalArgumentException during 
many sorting tasks, I propose to use java's native sort (which is relatively 
cheap), and catch the IllegalArgumentException (in many cases will be thrown 
early - not late), and then use whichever preferred "legacy" sorting algorithm 
on that failure.

Here is my implementation to show what I mean:

 

 
{code:java}
import java.util.Collections;
import java.util.Comparator;
import java.util.List;

public class SortUtils {
 public static <T> void sort(List<T> list, Comparator<T> comparator) {
   try{    
    Collections.sort(list, comparator);    
   }catch(java.lang.IllegalArgumentException e) {    
     MergeSort.sort(list, comparator);    
  } 
 }
}
 
{code}
This approach  (even when used on jdk7+) has a significant impact on sorting 
costs on real-world PDFs (even with the occasional "double-attempt").

I am also happy to share the MergeSort implementation I am using (which follows 
the sort contract for generics). It is iterative (not recursive).  It is 
borrowed from here:

[https://github.com/vlab-cs-ucsb/cashew/blob/master/jpf-security/src/examples/benchmarks/MergeSortIterative.java]

 

It has some slight tweaks (uses System.arraycopy, etc,) and is "generic-ified". 
Please just let me know. 

I am also curious to know any reasons (which I may have missed), why QuickSort 
is the preferred sort (if the fact that many lists are mostly pre-sorted is 
already known and considered)?

Thanks!

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (PDFBOX-5308) Prefer MergeSort over QuickSort and try native TimSort first (with explanation)

Reply via email to