[ 
https://issues.apache.org/jira/browse/PDFBOX-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437836#comment-17437836
 ] 

Alistair Oldfield commented on PDFBOX-5308:
-------------------------------------------

I published a gist of my iterative MergeSort here:

[https://gist.github.com/alistairo/dd61a79d6962b9b12bcd900838f31a3f]

Feel free to use it.

I would recommend using the try-catch approach I outlined above, however, as 
only in rare unhappy cases will the TextPositionComparator throw an 
IllegalArgumentException when being used by the very impressive native TimSort 
implementation.

Regarding Array.sort:  Collections.sort eventually uses Array.sort at the 
bottom of all the turtles too. =)

 

> Prefer MergeSort over QuickSort and try native TimSort first (with 
> explanation)
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5308
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5308
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.24
>            Reporter: Alistair Oldfield
>            Priority: Minor
>         Attachments: sort-benchmark-02.txt, sort-benchmark.txt
>
>
> Hello!
> I have 2 related proposals for improvement (so I am logging as one ticket, I 
> am happy to split them if requested, however).
> 1)
> Propose to use an iterative MergeSort implementation over an iterative 
> QuickSort implementation in TextPosition sorting during text extraction.
> The reason: while QuickSort and MergeSort generally perform O(NlogN), and 
> while MergeSort uses slightly more space (O(N)), QuickSort's worst-case 
> performance is O(n2) and this case will, unfortunately, happen in (arguably) 
> very common cases during text extraction: when the list is already sorted (or 
> nearly sorted). Many PDFs will already stream text in the order of sorting. 
> In many use-cases, QuickSort is among the worst-performing sorting algorithms 
> for this particular task.
> This is why java's native sort on Collections (using TimSort - which assumes 
> a high frequency of pre-sorted lists in real-world data) is ideal for many 
> PDF text extraction sorting scenarios, and is a shame it is not being used. 
> This brings me to the next part of the improvement proposal:
> 2) 
> As the TextPositionComparator is not transitive, the current implementation 
> to avoid JDK7+ enforcing transitivity, and throwing the "Comparison method 
> violates its general contract" IllegalArgumentException is to check for java 
> version, and if lower than 7, then ALWAYS use QuickSort.
> As TimSort often "approaches" O(N) for many PDF extraction use-cases, and as 
> TextPositionComparator will not always cause an IllegalArgumentException 
> during many sorting tasks, I propose to use java's native sort (which is 
> relatively cheap), and catch the IllegalArgumentException (in many cases will 
> be thrown early - not late), and then use whichever preferred "legacy" 
> sorting algorithm on that failure.
> Here is my implementation to show what I mean:
>  
>  
> {code:java}
> import java.util.Collections;
> import java.util.Comparator;
> import java.util.List;
> public class SortUtils {
>  public static <T> void sort(List<T> list, Comparator<T> comparator) {
>    try{    
>     Collections.sort(list, comparator);    
>    }catch(java.lang.IllegalArgumentException e) {    
>      MergeSort.sort(list, comparator);    
>   } 
>  }
> }
>  
> {code}
> This approach  (even when used on jdk7+) has a significant impact on sorting 
> costs on real-world PDFs (even with the occasional "double-attempt").
> I am also happy to share the MergeSort implementation I am using (which 
> follows the sort contract for generics). It is iterative (not recursive).  It 
> is borrowed from here:
> [https://github.com/vlab-cs-ucsb/cashew/blob/master/jpf-security/src/examples/benchmarks/MergeSortIterative.java]
>  
> It has some slight tweaks (uses System.arraycopy, etc,) and is 
> "generic-ified". Please just let me know. 
> I am also curious to know any reasons (which I may have missed), why 
> QuickSort is the preferred sort (if the fact that many lists are mostly 
> pre-sorted is already known and considered)?
> Thanks!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to