I won't comment on the patch itself, but I will make two comments.

1) Your assumptions about how Acrobat/Reader work is incorrect.
2) You should consider taking PDF structure/tagging into account when present.

Leonard

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of 
Daniel Garcia Moreno
Sent: Monday, September 06, 2010 4:31 AM
To: [email protected]
Subject: [poppler] New selection algorithm

Poppler does not make table selection in "order". It detects tables as columns, 
because poppler uses distance between text to decide what is a column so tables 
are selected in column order when the "logic way" is as rows.

Other problem in selection caused by that heuristic is when you have a pdf with 
near columns or text with spaces.

I looked at acroread to see how it does columns and tables selection and I 
realized that it selects text in "order", I mean, in the order that you put it 
in pdf file. To see that I created a text pdf file with inkscape.

So the selection logic is simple, we select the nearest word to the first 
selection point and the nearest word to the last selection point, and every 
word between that two words (in text order, no matter where the words are at 
screen) is selected too.

I have implemented [1] that logic and it seems to work better that current one. 
I made a video to show the new logic implemented in action [2].

To implement that I use TextWordList in TextPage, and to get that list well 
ordered I create TextOutputDev as rawOrder in selection, I have change that 
only in glib frontend so other frontends maybe don't work ok.

So the big implementation problem is to find the first and the last index in 
wordlist that defines the selection, and it is an easy algorithm. And for RTL 
documents I reverse wordlist by line and change word selection index, so the 
algorithm works with RTL too.

So, what do you think about that new selection algorithm? It seems that works 
better than current one, and it's simpler, but I don't know if I forget 
something about selection or maybe performance...

I attach the patch, it's divided in two commits, and maybe commit messages 
aren't *correct*.

[1] http://github.com/danigm/poppler/commits/selection
[2] http://www.youtube.com/watch?v=9bRH1yLCs4o
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to