On Thu, May 07, 2009 at 14:24:30 +0200, jema...@gnu.org wrote: > > Hi Michael. > > Based on your patch I just committed some early work on the API of the > tokeniser module. You can find it in the reference manual > (doc/gnupdf.texi). > > I would like to discuss it if you like.
Sure, my comments are below. > As it is now the prefix used for the module is (pdf_token*). The > tokeniser module would provide support for: > > - Read PDF tokens from a reading stream. > - Write PDF tokens into a writing stream. > > The data types provided by the module would be: > > - pdf_token_t > > Data type representing a typed PDF token. Each pdf_token_t can > have a number of attributes that may influence how the token is > written (such as the use of the hex representation for strings). > > NOTE: an alternative would be to use flags to pdf_token_write(). ... Are hex strings the only use case for attributes? Generally there's not much choice about how to write a token, except for deciding which string (or name) characters to escape, when to use hex strings, etc. I'm thinking that the "hex" attribute would make more sense as a flag for pdf_token_write (if it's needed at all) -- it's not really an attribute of a string, just an arbitrary decision on how to write the string. In normal operation this should be decided automatically, maybe based on a policy configured for the stream writer (e.g., "avoid 8-bit characters"). > The module would provide several sets of functions: > > - Functions to create/destroy readers and writers > > pdf_token_reader_new (stm, &reader) > pdf_token_writer_new (stm, &writer) > pdf_token_reader_destroy (reader) > pdf_token_writer_destroy (writer) > ... The document misnames the _destroy functions as pdf_tokeniser_reader_destroy and pdf_tokeniser_writer_destroy. > - Functions for reading and writing streams > > pdf_token_read (reader, &token) > pdf_token_write (writer, token) These functions (and _new, _destroy) seem reasonable, but we should determine what happens when pdf_token_write writes a partial token to the stream (i.e., does the user have to call the function again with the same token? And who keeps track of the position?). > - Functions to manipulate token variables > > type = pdf_token_get_type (token) > pdf_token_set_type (token, type) > pdf_token_set_attribute (token, attribute) > ... In the example for pdf_token_get_type, the return value of pdf_token_read should be checked (if it failed, _get_type would get an uninitialised pointer and probably crash). _set_type doesn't make sense. Constructors should be defined for each type; based on the changes I made for pdf-obj.h previously, these could be used: pdf_status_t pdf_token_integer_new (int value, pdf_token_t *obj); pdf_status_t pdf_token_real_new (pdf_real_t value, pdf_token_t *obj); pdf_status_t pdf_token_string_new (const pdf_char_t *value, pdf_size_t size, pdf_token_t *obj); pdf_status_t pdf_token_name_new (const pdf_char_t *value, pdf_size_t size, pdf_token_t *obj); /* _valueless_new is for {DICT,ARRAY,PROC}_{START,END} tokens */ pdf_status_t pdf_token_valueless_new (pdf_token_type_t type, pdf_token_t *obj); pdf_status_t pdf_token_comment_new (const pdf_char_t *value, pdf_size_t size, pdf_bool_t continuation, pdf_token_t *obj); pdf_status_t pdf_token_keyword_new (const pdf_char_t *value, pdf_size_t size, pdf_token_t *obj); /* is _dup needed? */ pdf_status_t pdf_token_dup (const pdf_token_t obj, pdf_token_t *new); Attribute accessors are also needed, and can be based on pdf-obj.h too. There were two additional tokeniser functions in my patch, intended for dealing with streams (and I think they'll still be needed): /* Advance to the first byte of a stream; see PDF32000 7.3.8.1 * note 2 (call this after reading the "stream" keyword) */ pdf_status_t pdf_tokeniser_end_at_stream(pdf_tokeniser_t tokr); /* Reset the state (e.g., after seeking past a stream) */ pdf_status_t pdf_tokeniser_reset_state(pdf_tokeniser_t tokr); (I can document all these extra functions if no changes are needed.) > The idea of this module is to make it independent from the parser that > will be implemented in the object layer. Also, it will be used by the > type 4 functions implementation in pdf-fp-func.[ch]. It would be quite > useful for the user, also. > > At this point it is critical to identify the needed token types. They're listed in the original patch (enum pdf_token_type_e): WSPACE (not needed, but may be useful to someone) COMMENT (for handling "%PDF-" headers, "%%EOF" footers, etc.) KEYWORD (any alphanumeric string not matching another token type; includes "null", content stream ops, etc.) INTEGER REAL NAME (starts with "/") STRING DICT_START ("<<") DICT_END (">>") ARRAY_START ("[") ARRAY_END ("]") PROC_START ("{", for type 4 functions) PROC_END ("}") Your list included BOOLEAN and NULL types, but these should be tokenised as type KEYWORD. The parser will convert them to the proper object types. INDIRECT was also listed. Should this be REF? That would be tokenised as INTEGER, INTEGER, KEYWORD ("R"). -- Michael
signature.asc
Description: Digital signature