Re: [htdig] Using pdftotext to index PDF documents

Patrick Dugal Mon, 1 Mar 1999 13:55:15 -0500

Gilles Detillieux wrote:

> There's still a bit more work to be done.  Patrick mentioned that
> pdftotext changed hyphens to spaces.

I don't think I ever said that.  I mentioned that GhostScript's ps2ascii which takes
pdf as input whenever it feels like it, translates hyphens (-) into spaces.  Xpdf's
pdftotext leaves the hyphens in, just as they are in the pdf.  This may hinder the
results of a search, but at least it's consistent.

> (Which raises the question: "why can't an external
> parser just pass plain text or HTML to htdig for further parsing?")

Very good question.  By intuition, I thought this was the way it should work.  This
way, it would be easier to configure, without having to get into any programming
adjustments.

> Some users may also want to extract the titles from their PDFs, as
> Sylvain's code did.

The "title" field located in a pdf is not as meaningful as one would like.  As far as
I know, there is no consistent way to extract the real title of a document.  How does
Adobe expect people to be able to index large numbers of PDF's?

> Anyway, here's Derek's fix for my concatenation problem:
>
> --- xpdf/TextOutputDev.cc.deltax        Fri Nov 27 21:42:16 1998
> +++ xpdf/TextOutputDev.cc       Thu Feb 25 09:55:28 1999
> @@ -217,6 +217,7 @@ void TextPage::addChar(GfxState *state,
>    double x1, y1, w1, h1;
>
>    state->transform(x, y, &x1, &y1);
> +  dx -= state->getCharSpace();
>    state->transformDelta(dx, dy, &w1, &h1);
>    curStr->addChar(state, x1, y1, w1, h1, c, useASCII7);
>  }

This patch worked!  I tested it on the the "profile_rob_98.txt" and the output was
much better.  Kuddos to Derek!
Thanks to Gilles for getting in touch.

Pat :)

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.
Re: [htdig] Using pdftotext to index PDF documents

Reply via email to