Re: [MacPerl] Pdf /Titles

Alan Fry Tue, 28 May 2002 02:33:45 -0700

At 3:49 pm +0200 27/05/02, Bart Lateur wrote:
>Sorry for the late reply. Actually, no, I'm not sorry, I've been away
>for a few weeks, so it's actually not my fault. :-)

Bart's note has finally turned up here.

>  >Problem 2:
>>index() will also find blocks which look like the right one
>>but are really the wrong objects ("14 0 obj", "4 0 obj").
>
>Then use a regex. No need to use pos() any more to find out where the
>match starts, $-[0] can tell you it is. $info_block should contain a
>number, shouldn't it?
>
>       my $info_start = $str =~ /\b$info_block 0 obj\b/ && $-[0];
>
>I think this will even be plug-in compatible with your original
>solution.

In fact you do not know where the 'info_block' is -- it can be almost 
anywhere in the file. The 'classical' method is to look up the 
string-position in the table. However this becomes quite a convoluted 
process if the PDF file has been 'linearised'. The first version of 
the script used direct look up and that was abandoned because of the 
complications with linearised files.

The second try used a regex. There are three difficulties with that. 
In the first place you have to use $` to get the starting point of 
the 'info-block', which is not very nice. Secondly you don't know 
where the end of the 'info_block' is, so you have to make a guess as 
to how much to fetch in order to be sure you have included the line 
beginning '/Title:'. The third difficulty is avoiding a false match 
(of the kind Axel mentions). In the ordinary run of events one would 
match to /^14 0 obj/. But here you have to be careful because the PDF 
file can have any of the three line-endings. I think a regex of the 
kind /\012|\01514 0 obj/ would probably be water-tight, but I haven't 
tried it.

Most of the problems disappear using index() to find the positions of 
the start and end of the 'info_block'. Admittedly having found a 
candidate you have to look back to make sure the preceding character 
is a line-break of some sort, which involves a loop of the kind Axel 
suggested. However this is a very economical solution since nine 
times out of ten the first candidate will be the right one. Even if 
the first candidate fails, the second candidate is almost bound to 
succeed. The only time lost, so to speak, is that taken to look 
backwards, which is pretty negligible.

At 4:09 pm -0400 27/05/02, Chris Nandor wrote:
>Sorry!  Space at the end of the filename.  ":file.txt " <-- space here.

Grrr. No, my fault. I looked at that for an age and missed it. Gray 
cell deficit rather than eyesight I think.

At 4:08 pm -0400 27/05/02, Ronald J Kimball wrote:
>  > >The greater danger with C< open F, $f > is that the filename might begin
>>  >with a ">" or somesuch.  Both three-arg open, and the method above with
>>  >"\0", solve both problems; but the latter method works in any version of
>>  >perl.  I am not a big fan of three-arg open, but I have to admit it looks a
>>  >lot nicer.  :-)
>
>Note that this solution does not work with leading spaces, however.  You
>have to use sysopen or the new three-arg open to handle those.

So I suppose for the time being, until all the world has updated to 
5.6.x, one should use 'sysopen()' as a matter of course?

Alan Fry

Re: [MacPerl] Pdf /Titles

Reply via email to