Re: [pdf-devel] Updated tokeniser/parser patch

jemarch Thu, 22 Jan 2009 11:25:12 -0800


Hi Michael.


   Here's an updated version of my previous patch, which adds a parser in
   addition to the tokeniser, and also makes major changes to pdf-obj.[ch].
   There's still some incomplete stuff (like the pdf_obj_dict_ functions)
   but it would probably be OK to merge once I complete the copyright
   assignment. The next component to add would be a reader that can read
   the xref table and resolve indirect object references.

Before to proceed with the implementation of code pertaining to the
object layer, we need to complete other activities.

I estimate that by the end of this weekend I will finish the drafts
for:

- The public API to be offered by the object layer to the client
  applications and the upper layers.

- The overall design of the object layer:

  + What modules are we going to implement, the specific role of each
    module, and how modules collaborate to implement the public
    interface. This is not trivial. Among other things, we will have
    to decide how to implement the garbage collection of indirect
    objects when saving the document, how to manage the creation of
    new stream objects (the Acrobat Sdk uses temporary files, for
    example), and a large etc. Some decisions in this phase will have
    a direct impact on the implementation of the parser, for example:
    do we want to have a separate parser for xref tables? can we use
    the same data type (pdf_obj_t) for both the public interface and
    to be used by the parser?, etc.

    Unlike the base layer (where the modules are pretty independent
    one from another) the object layer requires a quite careful design
    in order to avoid severe design problems and thus posterior
    rewritings.

  + The internal interfaces between the modules, that should be
    carefully documented in the architecture guide and validated
    against the collaboration schema to detect flaws.
 
- Some guidelines about how to document the internal interfaces (the
  modules in the base layer implement public interfaces only) in the
  architecture guide, and other housekeeping.

Then we will have to intensively work (together) on the drafts to get
a coherent and good enough architecture for the layer. After that I
will create the tasks for the implementation of the layer. Following
our general development procedures (see
http://www.gnupdf.org/manuals/gnupdf-hg.html/Development-procedures.html)
the tasks will cover:

- The design of the module-level tests.
- The implementation of the modules.
- The implementation of the tests.

Note also that the existing code in src/object/ is by all means
obsolete and useless. I wrote it too quickly and in the time before we
started to follow a more "structured" development method.

So the next week we will have a lot of work designing and
brainstorming. I am trying to set up a reasonable base with the
drafts, but a lot of things will be incomplete or simply wrong.

Of course your code for the tokeniser will be quite useful, since I
think that we will be able to use it as-is.

   One annoying thing I noticed was that pdf_list_t needs a heap allocation
   to use an iterator, which means that pdf_obj_equal_p could fail with an
   ENOMEM error (but currently has no way to return that error). It would
   be nice if the iterator could be kept on the stack -- struct
   pdf_list_iterator_s only contains one member anyway.

Gerel, what do you think about this? We would of course loose the
benefits of the opaque pointers in this case, but 'pdf_obj_equal_p'
throwing PDF_ENOMEM sounds quite weird, and I think that we could make an
exception and publish the iterator structure.

   The gnulib list code will also need to be changed to return ENOMEM when
   necessary -- currently it just calls abort() when malloc fails.

Yes, we noticed that. Would be really nice to modify the list module
to not crash if 'xalloc_die' returns NULL (does nothing). Really, I
think that it is the only way to use the list module in a library.

I noticed that you are active in the gnulib development mailing
list. Would you want to raise the issue there?

   A message was posted a few days ago about error management in type 4
   functions. My parser already handles syntax errors (for now, it just
   returns PDF_EBADFILE), and it would be fairly easy to make the parser
   read type 4 functions if it would help.

The (quite simple) parser for type 4 functions is implemented in
src/base/pdf-fp-func.[ch] and it already provides detection of syntax
errors and some run time errors. It is also trivial to adapt it to
cover more error situations. As the task description states, the
bigger problem there is to come with a suitable and general enough
interface for 'pdf_fp_func_eval'.

For 'pdf_fp_func_4_new', since it is only used for type 4 functions,
the addition of a new parameter of type 'struct
*pdf_fp_func_4_errors_s' should do it.

--
Jose E. Marchesi <jema...@gnu.org>
GNU Project

Re: [pdf-devel] Updated tokeniser/parser patch

Reply via email to