Re: [Groff] pdfmom grep (was parallel text processing)

Steffen Nurpmeso Fri, 08 Sep 2017 11:08:44 -0700

Peter Schaffter <[email protected]> wrote:
 |On Fri, Sep 08, 2017, Ralph Corderoy wrote:
 |>> You'll notice that the top of the pdf file has a line of text spit out
 |>> by grep(1) that obviously shouldn't be there.
 |> 
 |> I couldn't come up with the groff 1.22.3-7 command line required to
 |> build the PDF correctly, nor get grep's unwanted output.  Deri suggested
 |> pdfmom's grep might be the culprit, but its stderr should end up on
 |> pdfmom's stderr?
 |
 |Problem solved.
 |
 |The superfluous line at the top of the file ["Binary file (standard
 |input) matches"] isn't stderr, it's stdout, so it becomes part of
 |the pipeline.  The grep in pdfmom is returning a binary file hit when
 |it encounters the diacritic in 
 |
 |  .ds pdf:look(pdf:bm1) L'étranger
 |
 |Since the binary file hit doesn't begin with .ds, it prints literally
 |at the top of the file.
 |
 |The solution is to pass the -a flag to grep.
 |
 |Deri: do you want me to fix this in pdfmom and push the change, or
 |would you prefer to do it yourself?
 |
 |Question: why does grep treat the presence of the diacritic as cause
 |for saying "Binary file (standard input) matches"?


Likely because that is true in your locale?  It is very likely
that this cannot work (i see -k could possibly happen), suppose
you are in a LATIN1 locale and process UTF-8, and it is even worse
when your own locale is more picky than LATIN1.  Strives me this
should be split up so that perl itself performs the grep, in
charset-agnostic mode.  Even very large documents should generate
no limit here, otherwise there is no problem to create the two
pipelines concurrently ...

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

Re: [Groff] pdfmom grep (was parallel text processing)

Reply via email to