[
https://issues.apache.org/jira/browse/PDFBOX-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435638#comment-17435638
]
Alistair Oldfield commented on PDFBOX-5308:
-------------------------------------------
I have noticed when using documents in the wild, that QuickSort is on fire when
running with profiling on. This led me down the sorting rabbit hole where I
made a few changes and have noticed a subjective improvement. When we are
talking about a single document, then the improvement is obviously negligible,
however in larger apps, running in parallel batches, I believe the difference
is real.
In response to your question, I patched together a very quick and dirty
benchmark test. Not exhaustive in any way. I am not measuring time, but rather
comparisons when sorting.
Extending the existing position Comparator:
{code:java}
public class TextPositionCountComparator extends TextPositionComparator {
public int count = 0;
@Override public int compare(TextPosition pos1, TextPosition pos2) { count++;
return super.compare(pos1, pos2); }
}
{code}
then modifying PDFTextStripper.writePage():
{code:java}
// because the TextPositionComparator is not transitive, but
// JDK7+ enforces transitivity on comparators, we need to use
// a custom quicksort implementation (which is slower, unfortunately).
/*
if (useCustomQuickSort)
{
QuickSort.sort(textList, comparator);
}
else
{
Collections.sort(textList, comparator);
}
*/
TextPositionCountComparator cmp1 = new TextPositionCountComparator();
SortUtils.sort(textList, cmp1);
System.out.println("NativeSort: " + cmp1.count);
TextPositionCountComparator cmp2 = new TextPositionCountComparator();
QuickSort.sort(textList, cmp2);
System.out.println("QuickSort: " + cmp2.count);
TextPositionCountComparator cmp3 = new TextPositionCountComparator();
MergeSort.sort(textList, cmp3);
System.out.println("MergeSort: " + cmp3.count);
{code}
Then in the Driver:
{code:java}
String text = stripper.getText(doc);
System.out.println("TEXT LENGTH: "+text.length());
System.out.println("------------------------");{code}
I ran this on 48 PDFs I had handy.
Attached is the console out.
For each page of each doc you'll see something like:
NativeSort: 3264
QuickSort: 35282
MergeSort: 16272
And then at the end of each PDF text extraction something like:
TEXT LENGTH: 33617
------------------------
The attached console out has about 8K lines, but they mostly roughly follow the
pattern above in terms of numbers. [^sort-benchmark.txt]
> Prefer MergeSort over QuickSort and try native TimSort first (with
> explanation)
> -------------------------------------------------------------------------------
>
> Key: PDFBOX-5308
> URL: https://issues.apache.org/jira/browse/PDFBOX-5308
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Affects Versions: 2.0.24
> Reporter: Alistair Oldfield
> Priority: Minor
> Attachments: sort-benchmark.txt
>
>
> Hello!
> I have 2 related proposals for improvement (so I am logging as one ticket, I
> am happy to split them if requested, however).
> 1)
> Propose to use an iterative MergeSort implementation over an iterative
> QuickSort implementation in TextPosition sorting during text extraction.
> The reason: while QuickSort and MergeSort generally perform O(NlogN), and
> while MergeSort uses slightly more space (O(N)), QuickSort's worst-case
> performance is O(n2) and this case will, unfortunately, happen in (arguably)
> very common cases during text extraction: when the list is already sorted (or
> nearly sorted). Many PDFs will already stream text in the order of sorting.
> In many use-cases, QuickSort is among the worst-performing sorting algorithms
> for this particular task.
> This is why java's native sort on Collections (using TimSort - which assumes
> a high frequency of pre-sorted lists in real-world data) is ideal for many
> PDF text extraction sorting scenarios, and is a shame it is not being used.
> This brings me to the next part of the improvement proposal:
> 2)
> As the TextPositionComparator is not transitive, the current implementation
> to avoid JDK7+ enforcing transitivity, and throwing the "Comparison method
> violates its general contract" IllegalArgumentException is to check for java
> version, and if lower than 7, then ALWAYS use QuickSort.
> As TimSort often "approaches" O(N) for many PDF extraction use-cases, and as
> TextPositionComparator will not always cause an IllegalArgumentException
> during many sorting tasks, I propose to use java's native sort (which is
> relatively cheap), and catch the IllegalArgumentException (in many cases will
> be thrown early - not late), and then use whichever preferred "legacy"
> sorting algorithm on that failure.
> Here is my implementation to show what I mean:
>
>
> {code:java}
> import java.util.Collections;
> import java.util.Comparator;
> import java.util.List;
> public class SortUtils {
> public static <T> void sort(List<T> list, Comparator<T> comparator) {
> try{
> Collections.sort(list, comparator);
> }catch(java.lang.IllegalArgumentException e) {
> MergeSort.sort(list, comparator);
> }
> }
> }
>
> {code}
> This approach (even when used on jdk7+) has a significant impact on sorting
> costs on real-world PDFs (even with the occasional "double-attempt").
> I am also happy to share the MergeSort implementation I am using (which
> follows the sort contract for generics). It is iterative (not recursive). It
> is borrowed from here:
> [https://github.com/vlab-cs-ucsb/cashew/blob/master/jpf-security/src/examples/benchmarks/MergeSortIterative.java]
>
> It has some slight tweaks (uses System.arraycopy, etc,) and is
> "generic-ified". Please just let me know.
> I am also curious to know any reasons (which I may have missed), why
> QuickSort is the preferred sort (if the fact that many lists are mostly
> pre-sorted is already known and considered)?
> Thanks!
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]