2009/12/12 Albert Astals Cid <[email protected]>: > A Dimecres 09 Desembre 2009 23:22:09, Baz va escriure: >> 2009/12/9 Albert Astals Cid <[email protected]>: >> > A Dimecres 09 Desembre 2009 14:51:59, Baz va escriure: >> >> 2009/12/8 Albert Astals Cid <[email protected]>: >> >> > What we want is something that makes text extraction/selection better, >> >> > the definition of better is the problem here :D >> >> >> >> Ok. So it sounds like it would be worth adding tests in, so we can be >> >> explicit about what we want text extraction to do. >> >> >> >> I could do this in two ways: >> >> - write a test harness that calls the apis directly (following the >> >> example of cairo). This has the advantage that more apis could be >> >> tested later, but complicates writing the tests; and in any case most >> >> other tests will be about rendering not text extraction. Since this >> >> would be a unit test, its also fragile to API changes. >> >> - extend pdftotext to allow me to specify start and end points for >> >> text extraction (page,x,y). This would make writing tests easy - just >> >> simple shell scripts along the lines of the git test suite. This >> >> feature could be useful to end users too, I guess. >> >> >> >> I like the second plan better, since it supports building ad-hoc tests >> >> with pdfs attached to bugs. Since we already have -f and -l, (and -x, >> >> -y do something unrelated to the selection) I'm thinking of int args >> >> -fx, -fy, -lx, -ly, which default to (0,0) (pageWidth, pageHeight). >> > >> > Why isn't x,y,W,H enough? AFAIR they define which area gets extracted. >> >> Its not the same area. That mechanism crops every page from start to >> finish to the same x,y,W,H box before dumping the text. Its useful for >> removing header/footer sections in a whole-document dump. It also >> doesn't hit the text selection code at all. > > I'm lost now, you originally said pdftotext was using your new code and now > you say it doesn't?
I am talking about how pdftotext works, whether or not you have my changes. pdftotext does not use xyWH for *selection* (the way it would work in evince) it uses it to *crop*. However the text that it does ouput is in the same order as it would be if you selected *all the text on the cropped page*. Ok it was misleading to say 'It also doesn't hit the text selection code at all' I should have said 'it never passes text selection coordinates other than those that would select *everything*'. So it tests reading order but not the selection points in any meaningful way. Does this make it clearer? > > Albert > >> >> By contrast, a reading-order selection, even on a single page, may >> include text that lies outside the rectangle from the startpoint to >> the endpoint. Also, the xyWH mechanism applies the start/end points to >> every page, instead of only the start/end page as you would with a >> selection. >> >> -Baz >> >> > Albert >> > >> >> Does this sound useful to you? >> >> >> >> -Baz >> >> _______________________________________________ >> >> poppler mailing list >> >> [email protected] >> >> http://lists.freedesktop.org/mailman/listinfo/poppler >> > >> > _______________________________________________ >> > poppler mailing list >> > [email protected] >> > http://lists.freedesktop.org/mailman/listinfo/poppler >> >> _______________________________________________ >> poppler mailing list >> [email protected] >> http://lists.freedesktop.org/mailman/listinfo/poppler >> > _______________________________________________ > poppler mailing list > [email protected] > http://lists.freedesktop.org/mailman/listinfo/poppler > _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
