Alistair Oldfield created PDFBOX-5308:
-----------------------------------------
Summary: Prefer MergeSort over QuickSort and try native TimSort
first (with explanation)
Key: PDFBOX-5308
URL: https://issues.apache.org/jira/browse/PDFBOX-5308
Project: PDFBox
Issue Type: Improvement
Components: Text extraction
Affects Versions: 2.0.24
Reporter: Alistair Oldfield
Hello!
I have 2 related proposals for improvement (so I am logging as one ticket, I am
happy to split them if requested, however).
1)
Propose to use an iterative MergeSort implementation over an iterative
QuickSort implementation in TextPosition sorting during text extraction.
The reason: while QuickSort and MergeSort generally perform O(NlogN), and while
MergeSort uses slightly more space (O(n)), QuickSort's worst-case performance
is O(n2) and this case will, unfortunately, happen in (arguably) very common
cases during text extraction: when the list is already sorted (or nearly
sorted). Many PDFs will already stream text in the order of sorting. In many
use-cases, QuickSort is among the worst-performing sorting algorithms for this
particular task.
This is why java's native sort on Collections (using TimSort - which assumes a
high frequency of pre-sorted lists in real-world data) is ideal for many PDF
text extraction sorting scenarios, and is a shame it is not being used. This
brings me to the next part of the improvement proposal:
2)
As the TextPositionComparator is not transitive, the current implementation to
avoid JDK7+ enforcing transitivity, and throwing the "Comparison method
violates its general contract" IllegalArgumentException is to check for java
version, and if lower than 7, then ALWAYS use QuickSort.
As TimSort often "approaches" O(n) for many PDF extraction use-cases, and as
TextPositionComparator will not always cause an IllegalArgumentException during
many sorting tasks, I propose to use java's native sort (which is relatively
cheap), and catch the IllegalArgumentException (in many cases will be thrown
early - not late), and then use whichever preferred "legacy" sorting algorithm
on that failure.
Here is my implementation to show what I mean:
{code:java}
import java.util.Collections;
import java.util.Comparator;
import java.util.List;
public class SortUtils {
public static <T> void sort(List<T> list, Comparator<T> comparator) {
try{
Collections.sort(list, comparator);
}catch(java.lang.IllegalArgumentException e) {
MergeSort.sort(list, comparator);
}
}
}
{code}
This approach (even when used on jdk7+) has a significant impact on sorting
costs on real-world PDFs (even with the occasional "double-attempt").
I am also happy to share the MergeSort implementation I am using (which follows
the sort contract for generics). It is iterative (not recursive). It is
borrowed from here:
[https://github.com/vlab-cs-ucsb/cashew/blob/master/jpf-security/src/examples/benchmarks/MergeSortIterative.java]
It has some slight tweaks (uses System.arraycopy, etc,) and is "generic-ified".
Please just let me know.
I am also curious to know any reasons (which I may have missed), why QuickSort
is the preferred sort (if the fact that many lists are mostly pre-sorted is
already known and considered)?
Thanks!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]