A Dimecres 09 Desembre 2009 23:22:09, Baz va escriure: > 2009/12/9 Albert Astals Cid <[email protected]>: > > A Dimecres 09 Desembre 2009 14:51:59, Baz va escriure: > >> 2009/12/8 Albert Astals Cid <[email protected]>: > >> > What we want is something that makes text extraction/selection better, > >> > the definition of better is the problem here :D > >> > >> Ok. So it sounds like it would be worth adding tests in, so we can be > >> explicit about what we want text extraction to do. > >> > >> I could do this in two ways: > >> - write a test harness that calls the apis directly (following the > >> example of cairo). This has the advantage that more apis could be > >> tested later, but complicates writing the tests; and in any case most > >> other tests will be about rendering not text extraction. Since this > >> would be a unit test, its also fragile to API changes. > >> - extend pdftotext to allow me to specify start and end points for > >> text extraction (page,x,y). This would make writing tests easy - just > >> simple shell scripts along the lines of the git test suite. This > >> feature could be useful to end users too, I guess. > >> > >> I like the second plan better, since it supports building ad-hoc tests > >> with pdfs attached to bugs. Since we already have -f and -l, (and -x, > >> -y do something unrelated to the selection) I'm thinking of int args > >> -fx, -fy, -lx, -ly, which default to (0,0) (pageWidth, pageHeight). > > > > Why isn't x,y,W,H enough? AFAIR they define which area gets extracted. > > Its not the same area. That mechanism crops every page from start to > finish to the same x,y,W,H box before dumping the text. Its useful for > removing header/footer sections in a whole-document dump. It also > doesn't hit the text selection code at all.
I'm lost now, you originally said pdftotext was using your new code and now you say it doesn't? Albert > > By contrast, a reading-order selection, even on a single page, may > include text that lies outside the rectangle from the startpoint to > the endpoint. Also, the xyWH mechanism applies the start/end points to > every page, instead of only the start/end page as you would with a > selection. > > -Baz > > > Albert > > > >> Does this sound useful to you? > >> > >> -Baz > >> _______________________________________________ > >> poppler mailing list > >> [email protected] > >> http://lists.freedesktop.org/mailman/listinfo/poppler > > > > _______________________________________________ > > poppler mailing list > > [email protected] > > http://lists.freedesktop.org/mailman/listinfo/poppler > > _______________________________________________ > poppler mailing list > [email protected] > http://lists.freedesktop.org/mailman/listinfo/poppler > _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
