[pdf-devel] Re: Initial API for the tokeniser module

Michael Gold Tue, 12 May 2009 23:32:40 -0700

On Thu, May 07, 2009 at 14:24:30 +0200, jema...@gnu.org wrote:
> 
> Hi Michael.
> 
> Based on your patch I just committed some early work on the API of the
> tokeniser module. You can find it in the reference manual
> (doc/gnupdf.texi).
> 
> I would like to discuss it if you like.


Sure, my comments are below.

> As it is now the prefix used for the module is (pdf_token*). The
> tokeniser module would provide support for:
> 
>   - Read PDF tokens from a reading stream.
>   - Write PDF tokens into a writing stream.
> 
> The data types provided by the module would be:
> 
>   - pdf_token_t
> 
>     Data type representing a typed PDF token. Each pdf_token_t can
>     have a number of attributes that may influence how the token is
>     written (such as the use of the hex representation for strings).
> 
>     NOTE: an alternative would be to use flags to pdf_token_write().
...

Are hex strings the only use case for attributes?  Generally there's not
much choice about how to write a token, except for deciding which string
(or name) characters to escape, when to use hex strings, etc.

I'm thinking that the "hex" attribute would make more sense as a flag
for pdf_token_write (if it's needed at all) -- it's not really an
attribute of a string, just an arbitrary decision on how to write the
string.  In normal operation this should be decided automatically, maybe
based on a policy configured for the stream writer (e.g., "avoid 8-bit
characters").

> The module would provide several sets of functions:
> 
>   - Functions to create/destroy readers and writers
> 
>     pdf_token_reader_new (stm, &reader)
>     pdf_token_writer_new (stm, &writer)
>     pdf_token_reader_destroy (reader)
>     pdf_token_writer_destroy (writer)
>     ...

The document misnames the _destroy functions as
pdf_tokeniser_reader_destroy and pdf_tokeniser_writer_destroy.

>   - Functions for reading and writing streams
> 
>     pdf_token_read (reader, &token)
>     pdf_token_write (writer, token)

These functions (and _new, _destroy) seem reasonable, but we should
determine what happens when pdf_token_write writes a partial token to
the stream (i.e., does the user have to call the function again with the
same token?  And who keeps track of the position?).

>   - Functions to manipulate token variables
> 
>     type = pdf_token_get_type (token)
>     pdf_token_set_type (token, type)
>     pdf_token_set_attribute (token, attribute)
>     ...

In the example for pdf_token_get_type, the return value of
pdf_token_read should be checked (if it failed, _get_type would get an
uninitialised pointer and probably crash).

_set_type doesn't make sense.  Constructors should be defined for each
type; based on the changes I made for pdf-obj.h previously, these could
be used:
    pdf_status_t pdf_token_integer_new (int value,
                                        pdf_token_t *obj);
    pdf_status_t pdf_token_real_new (pdf_real_t value,
                                     pdf_token_t *obj);
    pdf_status_t pdf_token_string_new (const pdf_char_t *value,
                                       pdf_size_t size,
                                       pdf_token_t *obj);
    pdf_status_t pdf_token_name_new (const pdf_char_t *value,
                                     pdf_size_t size,
                                     pdf_token_t *obj);
    /* _valueless_new is for {DICT,ARRAY,PROC}_{START,END} tokens */
    pdf_status_t pdf_token_valueless_new (pdf_token_type_t type,
                                          pdf_token_t *obj);
    pdf_status_t pdf_token_comment_new (const pdf_char_t *value,
                                        pdf_size_t size,
                                        pdf_bool_t continuation,
                                        pdf_token_t *obj);
    pdf_status_t pdf_token_keyword_new (const pdf_char_t *value,
                                        pdf_size_t size,
                                        pdf_token_t *obj);
    /* is _dup needed? */
    pdf_status_t pdf_token_dup (const pdf_token_t obj,
                                pdf_token_t *new);

Attribute accessors are also needed, and can be based on pdf-obj.h too.

There were two additional tokeniser functions in my patch, intended for
dealing with streams (and I think they'll still be needed):
    /* Advance to the first byte of a stream; see PDF32000 7.3.8.1
     * note 2 (call this after reading the "stream" keyword) */
    pdf_status_t pdf_tokeniser_end_at_stream(pdf_tokeniser_t tokr);

    /* Reset the state (e.g., after seeking past a stream) */
    pdf_status_t pdf_tokeniser_reset_state(pdf_tokeniser_t tokr);

(I can document all these extra functions if no changes are needed.)

> The idea of this module is to make it independent from the parser that
> will be implemented in the object layer. Also, it will be used by the
> type 4 functions implementation in pdf-fp-func.[ch]. It would be quite
> useful for the user, also.
> 
> At this point it is critical to identify the needed token types.

They're listed in the original patch (enum pdf_token_type_e):
  WSPACE      (not needed, but may be useful to someone)
  COMMENT     (for handling "%PDF-" headers, "%%EOF" footers, etc.)
  KEYWORD     (any alphanumeric string not matching another token type;
               includes "null", content stream ops, etc.)
  INTEGER
  REAL
  NAME        (starts with "/")
  STRING
  DICT_START  ("<<")
  DICT_END    (">>")
  ARRAY_START ("[")
  ARRAY_END   ("]")
  PROC_START  ("{", for type 4 functions)
  PROC_END    ("}")

Your list included BOOLEAN and NULL types, but these should be tokenised
as type KEYWORD.  The parser will convert them to the proper object
types.

INDIRECT was also listed.  Should this be REF?  That would be tokenised
as INTEGER, INTEGER, KEYWORD ("R").

-- Michael

signature.asc
Description: Digital signature

[pdf-devel] Re: Initial API for the tokeniser module

Reply via email to