Patrick Dugal
Mon, 1 Mar 1999 13:55:15 -0500
Gilles Detillieux wrote: > There's still a bit more work to be done. Patrick mentioned that > pdftotext changed hyphens to spaces. I don't think I ever said that. I mentioned that GhostScript's ps2ascii which takes pdf as input whenever it feels like it, translates hyphens (-) into spaces. Xpdf's pdftotext leaves the hyphens in, just as they are in the pdf. This may hinder the results of a search, but at least it's consistent. > (Which raises the question: "why can't an external > parser just pass plain text or HTML to htdig for further parsing?") Very good question. By intuition, I thought this was the way it should work. This way, it would be easier to configure, without having to get into any programming adjustments. > Some users may also want to extract the titles from their PDFs, as > Sylvain's code did. The "title" field located in a pdf is not as meaningful as one would like. As far as I know, there is no consistent way to extract the real title of a document. How does Adobe expect people to be able to index large numbers of PDF's? > Anyway, here's Derek's fix for my concatenation problem: > > --- xpdf/TextOutputDev.cc.deltax Fri Nov 27 21:42:16 1998 > +++ xpdf/TextOutputDev.cc Thu Feb 25 09:55:28 1999 > @@ -217,6 +217,7 @@ void TextPage::addChar(GfxState *state, > double x1, y1, w1, h1; > > state->transform(x, y, &x1, &y1); > + dx -= state->getCharSpace(); > state->transformDelta(dx, dy, &w1, &h1); > curStr->addChar(state, x1, y1, w1, h1, c, useASCII7); > } This patch worked! I tested it on the the "profile_rob_98.txt" and the output was much better. Kuddos to Derek! Thanks to Gilles for getting in touch. Pat :) ------------------------------------ To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word "unsubscribe" in the SUBJECT of the message.