Hi,
Sorry for a silence in a while. Checking the source,
I found following points.
1) poppler-qt4 page object issue
In Page::getText() method, poppler's TextOutputDev
object is created, and its getText() method is invoked.
In the creation of TextOutputDev, we can tune its
configuration to enable/disable physical layout,
enable/disable raw order mode, etc. I think, when
the vertical text is re-layouted for horizontal text
renderer, the result is logically broken ordered
when MS Office's tricky vertical text.
If I test TextOutputDev::displayPageSlice() method,
especially with rawOrder option, the text is not
re-layouted. For MS Office's tricky vertical text,
this is slightly better. However, displayPageSlice()
method is designed for FILE stream. If we can pass
the memory buffer to be filled by displayPageSlice(),
it is useful, but such change requires many modifications,
because displayPageSlice() is pan-device method.
# changing TextOutputDev.cc is insufficient, I
# have to change SplashOutputDev.cc, PSOutputDev.cc,
# CairoOutputDev.cc, ArthurOutputDev.cc, ABWOutputDev.cc...
# I cannot test all of them.
On the other hand, getText() is device specific method,
only in TextOutputDev.cc, so changing getText() is
easier.
2) TextOutputDev::getText() issue
Because most PDF generator does not draw spaces by font
but moves the current point simply, the tack of TextOutputDev
is not only the objects drawn by fonts. It cares about
the moving of current point to insert space character
(U+0020) at appropriate position. Thus, TextOutputDev is
also layout-aware device as other output devices.
TextOutputDev has optional switches for "force physical
layout" and "force raw order" of the internal text processing.
The results of "pdftotext -layout msword2007-vert.pdf -"
and "pdftotext -raw msword2007-vert.pdf -" shows the exist
of layout-aware routines in TextOutputDev very clearly.
I think, raw-ordered text from MS Office's tricky vertical
text can be applicable for text search, but physically-
layouted text cannot be applicable for text search.
2-a) re-layout in vertical writing mode is required?
We can find several interesting "TODO" comments in
TextOutputDev.cc:
2342 void TextPage::coalesce(GBool physLayout, GBool doHTML) {
...
2535 //----- assemble the blocks
2536
2537 //~ add an outer loop for writing mode (vertical text)
2538
2539 // build blocks for each rotation value
2540 for (rot = 0; rot < 4; ++rot) {
...
2830 //~ need to compute the primary writing mode (horiz/vert) in
2831 //~ addition to primary rotation
...
3316 // build the flows
3317 //~ this needs to be adjusted for writing mode (vertical text)
3318 //~ this also needs to account for right-to-left column ordering
3319 flow = NULL;
3320 while (flows) {
3321 flow = flows;
3322 flows = flows->next;
3323 delete flow;
3324 }
3325 flows = lastFlow = NULL;
3326 // assume blocks are already in reading order,
3327 // and construct flows accordingly.
...
3589 GooString *TextPage::getText(double xMin, double yMin,
3590 double xMax, double yMax) {
...
3632 //~ writing mode (horiz/vert)
3633
3634 // collect the line fragments that are in the rectangle
...
4651 void TextPage::dump(void *outputStream, TextOutputFunc outputFunc,
4652 GBool physLayout) {
...
4689 //~ writing mode (horiz/vert)
4690
4691 // output the page in raw (content stream) order
4692 if (rawOrder) {
...
>From the comments, the authors of TextOutputDev.cc seem to
be aware that the current layout analysis is specific to
horizontal text. I think it's a homework for CJK people,
but now I don't have sufficient time to work this issue fully.
# also we can find a few comments for right-to-left script.
But, if we restrict our scope to the text search on PDF,
I think raw-ordered extraction can work for most cases.
2-b) getText() for rawOrder TextOutputDev?
As I've written in above, the default, or, rawOrder mode
of pdftotext is useful for MS Office's tricky vertical text.
The rawOrder mode can be specified when TextOutputDev object
is created. But... When I create TextOutputDev object in
poppler-qt4 to extract raw-ordered text, TextOutputDev::getText()
returns NULL text. Oops. It is designed behaviour of
TextOutputDev::getText(). You can find following line in
TextOutputDev.cc.
3589 GooString *TextPage::getText(double xMin, double yMin,
3590 double xMax, double yMax) {
...
3605
3606 s = new GooString();
3607
3608 if (rawOrder) {
3609 return s;
3610 }
Yet I'm not sure why rawOrder case is discarded. As an
experiment, I wrote a rawOrder text extraction code aslike:
diff --git a/poppler/TextOutputDev.cc b/poppler/TextOutputDev.cc
index f244639..1803629 100644
--- a/poppler/TextOutputDev.cc
+++ b/poppler/TextOutputDev.cc
@@ -3702,10 +3702,6 @@ GooString *TextPage::getText(double xMin, double yMin,
s = new GooString();
- if (rawOrder) {
- return s;
- }
-
// get the output encoding
if (!(uMap = globalParams->getTextEncoding())) {
return s;
@@ -3726,6 +3722,23 @@ GooString *TextPage::getText(double xMin, double yMin,
break;
}
+ if (rawOrder) {
+ TextWord* word;
+ for (word = rawWords; word && word <= rawLastWord; word = word->next) {
+ for (j = 0; j < word->getLength(); ++j) {
+ double gXMin, gXMax, gYMin, gYMax;
+ word->getCharBBox(j, &gXMin, &gYMin, &gXMax, &gYMax);
+ if (xMin <= gXMin && gXMax <= xMax && yMin <= gYMin && gYMax <= yMax)
+ {
+ char mbc[16]; /* XXX: uMap should know the limit !*/
+ int mbc_len = uMap->mapUnicode( *(word->getChar(j)), mbc,
sizeof(mbc) );
+ s->append(mbc, mbc_len);
+ }
+ }
+ }
+ return s;
+ }
+
//~ writing mode (horiz/vert)
// collect the line fragments that are in the rectangle
Now TextOutputDev::getText() can extract the text from
TextOutputDev object in rawOrdered mode.
2-c) Line-joining issue in TextOutputDev::getText()
The raw text in rawOrdered TextOutputDev object has no spaces
between words. Here, "word" means a group of glyphs drawn by
fonts without external current point shifting. My experimental
patch in above inserts the spaces between words. The insertion
of spaces between words makes English text better, but causes
bad effects in MS Office's tricky vertical text. In MS Office's
tricky vertical text, each glyph is drawn after vertical shift
of current point, so all words consist from 1 glyph.
At present, I have 2 ideas to prevent such bad insertion of
spaces between tricky vertical text.
idea i:
Tracking the current point and the distance between glyphs,
and determine 2 glyphs are belonging 1 vertical or horizontal
line.
idea ii:
Referring line breaking algorithm in Unicode and determine
whether the space should be inserted between the glyphs.
- If the codepoints are Latin, the space is inserted.
- If the codepoints are CJK Ideographs, the space is NOT inserted.
- ...
I think idea ii is so simple and good to start an experiment,
although it can be acceptable for poppler.
Regards,
mpsuzuki
P.S.
I've attached a patch "20100801a.diff" to extend
1) TextOutputDev::getText() to support rawOrder mode.
2) Qt4 Page::text() to take extra flag for rawOrder boolean.
3) a test program for poppler-qt's text extraction.
On Wed, 28 Jul 2010 16:32:20 +0900
[email protected] wrote:
>Hi,
>
>On Wed, 28 Jul 2010 15:04:53 +0800 (CST)
>"cobra.yu" <[email protected]> wrote:
>> Of course, such fake vertical writing mode is unacceptable.
>
>Thanks.
>
>>So, it shows that we can't only count on the wMode of the font
>>information, but also take the real arrangent of text words on
>>pages into consideration?
>
>Yes, WMode is insufficient. As Deri analyzed, MS Office addin
>draws vertical text by repeating "draw a glyph, move current
>point vertically, draw a glyph...". So, it might be possible
>to detect the text flow direction by tracking the moving of
>current point. But, if our interest is only text search, the
>tracking of current point won't be essential, I think. Maybe
>collecting all glyphs in drawing order is sufficient for text
>search. I will check more detail in poppler-qt4 binding.
>
>Regards,
>mpsuzuki
>
>
>>-----Original message-----
>>From:suzuki toshiya <[email protected]>
>>To:[email protected]
>>Cc:poppler <[email protected]>
>>Date:Wed, 28 Jul 2010 15:18:58 +0900
>>Subject:Re: [poppler] Vertical or horizontal writing?
>>
>>
>>Hi,
>>
>>Please find attached fake vertical text produced by MS Excel
>>2007. Is it acceptable for you to exclude such fake vertical
>>text from your target?
>>
>>If you try to select the text on Adobe Reader, you can find
>>that the order of glyph drawing is horizontal, it is stupid
>>fake from the viewpoint of page rendering language.
>>
>>Regards,
>>mpsuzuki
>>
>>cobra.yu wrote:
>>> Hi,
>>>
>>> The original requirement to detect the direction of text flow is for
>>> "searching". The present "search" function of Poppler::Page is searching
>>> horizontally only. So, for CJK users, I must add one vertical search
>>> function for the vertical writing mode.
>>> I could sort out all the textboxes in every page by (x,y) of the
>>> bounding box to make a vertical-like textbox list, but I encountered a
>>> fundamental problem: If I can't know the exact direction of text flow
>>> first, how could I know when to use vertical or horizontal search?
>>> BTW, I've accomplished the vertical text selection by the same way as
>>> my vertical search right now, but it's rather simpler than searching indeed.
>>>
>>> Cobra
>>>
>>>
>>> -----Original message-----
>>> From:[email protected]
>>> To:Deri James <[email protected]>
>>> Cc:[email protected],[email protected]
>>> Date:Wed, 28 Jul 2010 01:59:40 +0900
>>> Subject:Re: [poppler] Vertical or horizontal writing?
>>>
>>> Dear Deri,
>>>
>>> On Tue, 27 Jul 2010 17:22:14 +0100
>>> Deri James <[email protected]> wrote:
>>>
>>>> When looking at the two PDFs you are using with acroread using the text
>>>> selection tool:-
>>>>
>>>> P1 of 'vert-horiz-ipa-std.pdf' selection caret is drawn horizontally.
>>>> 'msword2010-vert2.pdf' selection caret is drawn vertically.
>>>>
>>>> So, it seems acroread can't detect the vertical text in this file, i.e. it
>>>> is
>>>> actually horizontal text placed one glyph at a time (apart from 'MS Word
>>>> 2010'
>>>> which is horizontal text rotated 90 degrees).
>>>>
>>>> The contents of the stream confirms this:-
>>>>
>>>> stream
>>>> /P <</MCID 0/Lang (en-US)>> BDC BT
>>>> /F1 10.56 Tf
>>>> 0.000000001 -1 1 0.000000001 496.54 756.84 Tm
>>>> 0 g
>>>> 0 G
>>>> [(MS)6( )5(W)61(ord)-4( )5(20)10(10)] TJ
>>>> ET
>>>> EMC /P <</MCID 1>> BDC BT
>>>> /F2 10.56 Tf
>>>> 1 0.000000017 -0.000000017 1 495.29 673.7 Tm
>>>> <085B>Tj
>>>> ET
>>>> EMC /P <</MCID 2>> BDC BT
>>>> 1 0.000000017 -0.000000017 1 495.29 663.14 Tm
>>>> <29AA>Tj
>>>>
>>>
>>>
>>>> ...
>>>>
>>>> So this PDF does not have any true vertical text.
>>>>
>>>
>>> Yes, yes, just I've reached exactly same conclusion.
>>> Thank you for checking the content of PDF.
>>>
>>> The PDF generated by MS Office addin uses the font object
>>> for horizontal writing mode, in PDF design, at least. So
>>> the text flow detection in PDF font level does not work
>>> with such PDF. Higher level recognization is needed.
>>>
>>> It brings a philosophical question: what is vertical text?
>>> Some people makes vertical serie of CJK glyphs by using
>>> very very narrow text box, is this wrong vertical text?
>>> If they are not vertical text, why we should distinguish?
>>> The invalid shape of the punctuations & arrows? Or...
>>>
>>> I have to ask Cobra about what is the original requirement
>>> why the text direction should be detected. Cobra, could
>>> you describe why you needed to detect the direction of
>>> text flow?
>>>
>>> Regards,
>>> mpsuzuki
>>>
>>
>>
>_______________________________________________
>poppler mailing list
>[email protected]
>http://lists.freedesktop.org/mailman/listinfo/poppler
diff --git a/poppler/TextOutputDev.cc b/poppler/TextOutputDev.cc
index f244639..15c0251 100644
--- a/poppler/TextOutputDev.cc
+++ b/poppler/TextOutputDev.cc
@@ -3702,10 +3702,6 @@ GooString *TextPage::getText(double xMin, double yMin,
s = new GooString();
- if (rawOrder) {
- return s;
- }
-
// get the output encoding
if (!(uMap = globalParams->getTextEncoding())) {
return s;
@@ -3726,6 +3722,27 @@ GooString *TextPage::getText(double xMin, double yMin,
break;
}
+ if (rawOrder) {
+ TextWord* word;
+ char mbc[16]; /* XXX: uMap should know the limit !*/
+ int mbc_len;
+
+ for (word = rawWords; word && word <= rawLastWord; word = word->next) {
+ for (j = 0; j < word->getLength(); ++j) {
+ double gXMin, gXMax, gYMin, gYMax;
+ word->getCharBBox(j, &gXMin, &gYMin, &gXMax, &gYMax);
+ if (xMin <= gXMin && gXMax <= xMax && yMin <= gYMin && gYMax <= yMax)
+ {
+ mbc_len = uMap->mapUnicode( *(word->getChar(j)), mbc, sizeof(mbc) );
+ s->append(mbc, mbc_len);
+ }
+ }
+ mbc_len = uMap->mapUnicode( 0x20, mbc, sizeof(mbc) ); /* space between word */
+ s->append(mbc, mbc_len);
+ }
+ return s;
+ }
+
//~ writing mode (horiz/vert)
// collect the line fragments that are in the rectangle
diff --git a/qt4/src/poppler-page.cc b/qt4/src/poppler-page.cc
index ae67b11..a32b56a 100644
--- a/qt4/src/poppler-page.cc
+++ b/qt4/src/poppler-page.cc
@@ -295,14 +295,14 @@ QImage Page::thumbnail() const
return ret;
}
-QString Page::text(const QRectF &r) const
+QString Page::text(const QRectF &r, bool rawOrder) const
{
TextOutputDev *output_dev;
GooString *s;
PDFRectangle *rect;
QString result;
- output_dev = new TextOutputDev(0, gFalse, gFalse, gFalse);
+ output_dev = new TextOutputDev(0, gFalse, rawOrder, gFalse);
m_page->parentDoc->doc->displayPageSlice(output_dev, m_page->index + 1, 72, 72,
0, false, true, false, -1, -1, -1, -1);
if (r.isNull())
@@ -322,6 +322,11 @@ QString Page::text(const QRectF &r) const
return result;
}
+QString Page::text(const QRectF &r) const
+{
+ return this->text(r, FALSE);
+}
+
bool Page::search(const QString &text, double &sLeft, double &sTop, double &sRight, double &sBottom, SearchDirection direction, SearchMode caseSensitive, Rotation rotate) const
{
const QChar * str = text.unicode();
diff --git a/qt4/src/poppler-qt4.h b/qt4/src/poppler-qt4.h
index 5464372..c491ecf 100644
--- a/qt4/src/poppler-qt4.h
+++ b/qt4/src/poppler-qt4.h
@@ -458,7 +458,8 @@ delete it;
with coordinates given in points, i.e., 1/72th of an inch.
If rect is null, all text on the page is given
**/
- QString text(const QRectF &rect) const;
+ QString text(const QRectF &rect, bool rawOrder) const;
+ QString text(const QRectF &rect) const; /* older API, always physLayout */
/**
The starting point for a search
diff --git a/qt4/tests/Makefile.am b/qt4/tests/Makefile.am
index 7bc16d7..244097c 100644
--- a/qt4/tests/Makefile.am
+++ b/qt4/tests/Makefile.am
@@ -21,7 +21,7 @@ SUFFIXES: .moc
noinst_PROGRAMS = test-poppler-qt4 stress-poppler-qt4 \
poppler-fonts test-password-qt4 stress-poppler-dir \
- poppler-attachments
+ poppler-attachments poppler-texts
test_poppler_qt4_SOURCES = \
@@ -46,6 +46,11 @@ poppler_attachments_SOURCES = \
poppler_attachments_LDADD = $(LDADDS)
+poppler_texts_SOURCES = \
+ poppler-texts.cpp
+
+poppler_texts_LDADD = $(LDADDS)
+
stress_poppler_qt4_SOURCES = \
stress-poppler-qt4.cpp
diff --git a/qt4/tests/poppler-texts.cpp b/qt4/tests/poppler-texts.cpp
new file mode 100644
index 0000000..8f81524
--- /dev/null
+++ b/qt4/tests/poppler-texts.cpp
@@ -0,0 +1,54 @@
+#include <QtCore/QCoreApplication>
+#include <QtCore/QDebug>
+
+#include <iostream>
+
+#include <poppler-qt4.h>
+
+int main( int argc, char **argv )
+{
+ QCoreApplication a( argc, argv ); // QApplication required!
+
+ if (!( argc == 2 ))
+ {
+ qWarning() << "usage: poppler-texts filename";
+ exit(1);
+ }
+
+ Poppler::Document *doc = Poppler::Document::load(argv[1]);
+ if (!doc)
+ {
+ qWarning() << "doc not loaded";
+ exit(1);
+ }
+
+ int i;
+ for ( i = 0; i < doc->numPages(); i++ )
+ {
+ int j = 0;
+ std::cout << "*** Page " << i << std::endl;
+ std::cout << std::flush;
+
+ Poppler::Page *page = doc->page(i);
+ QRectF rect = QRectF( 0,
+ 0,
+ (int)doc->page(i)->pageSizeF().width(),
+ (int)doc->page(i)->pageSizeF().height() );
+#if 0
+ QString text = page->text( rect );
+ std::cout << std::flush;
+ QVector<uint> ucs4str = text.toUcs4();
+ for ( j = 0; j < ucs4str.size(); j++ )
+ {
+ std::cout << "<" << std::hex << ucs4str[j] << std::dec << ">" << std::endl;
+ }
+#else
+ QByteArray utf8str = page->text( rect, TRUE ).toUtf8();
+ std::cout << std::flush;
+ for ( j = 0; j < utf8str.size(); j++ )
+ std::cout << utf8str[j];
+ std::cout << std::endl;
+#endif
+ }
+ delete doc;
+}
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler