Re: [pdf-devel] Object layer API

Michael Gold Fri, 13 Feb 2009 00:09:52 -0800

On Fri, Feb 06, 2009 at 00:49:36 +0100, jema...@gnu.org wrote:
> 
> Hi.
> 
> I just committed in the trunk the first draft of the object layer
> public API. It is documented in the "Object Layer" chapter of the
> reference manual.


Can you post a version online that doesn't contain so many pages?
The object layer chapter alone contains over 20 pages, which makes it
very annoying to read -- a single downloadable file would be nice.

Comments on specific items follow.

[3.2.2]

pdf_obj_doc_new
 - Why not always create an info dictionary by default?
 - The "&doc" parameter should be "*doc".

pdf_obj_doc_open
 - Would it be better to take an open stream, rather than a filesystem
   and path?
 - What's the point of header_string?
    - The client probably doesn't care about the header -- they just
      want the library to open a valid PDF.
    - There may be more than one acceptable string, e.g.
      "%PDF-" and "%!PS-Adobe-N.n PDF-"
 - There's no mention of non-blocking I/O (this applies to other
   functions too).  If we'll allow people to open PDF files over HTTP
   (or using other slow methods), we'll need some way to handle the case
   where we need to wait for data -- we may need to return an incomplete
   document object, since we might read the header and then get
   EWOULDBLOCK when trying to read the trailer or xref table(s).
    - Maybe we should have one function to create the pdf_obj_doc_t, and
      another to initialize it from a stream.  The initialization
      function could return EWOULDBLOCK when necessary, and be called
      again when more data is available.

pdf_obj_doc_close
 - Wouldn't the handle become invalid on PDF_OK also?

pdf_obj_doc_save
 - Why wouldn't you want an xref table?  It should probably be written
   by default.
 - Why should _SAVE_FULL imply _SAVE_DO_GC?  Both flags could be given
   if desired.
 - If file==NULL, do we try to keep the same inode number?  Is the file
   corrupted after an error?
 - Should there be a separate GC function instead of a flag?
 - Again, there should be some consideration for non-blocking I/O.
 - Why does the client need to care about the header string?
    - If they specify an old version, do we ensure the output is
      compatible with that version?  What do we do with unrecognized
      versions?
    - I think we should let the user specify some kind of compatibility
      level, and let the library handle details like this.
 - How do we track modifications when _SAVE_FULL is not used?

[3.2.3]

pdf_obj_doc_get_id
 - Rather than pdf_char_t*/pdf_size_t pairs, we could use a pdf_obj_t
   with string type.

pdf_obj_doc_set_dirty
    "If the dirty flag of a document is set then its contents are saved
     before the document is closed."
 - Does this mean doc_close will write a file?  That doesn't seem right.

[3.3.1]

pdf_obj_gen_t
 - Should be described as "non-negative", not "positive".

[3.3.2]

pdf_obj_copy
 - If copy_indirect is false, what happens when an indirect object is
   seen?
 - Where does the object go?  I.e., if you throw away whatever's
   returned in *dest, does dest_doc still have some reference to it?
 - If an object didn't have a back-reference to a document, something
   like pdf_obj_dup would be sufficient.
 - If a stream is copied, does closing the original file invalidate it?
   (if not, where's the data stored?)

pdf_obj_destroy
    "This function is a nop if obj is a direct scalar type or the Null
     object."
 - Why would it be a nop?  Scalar objects need to be freed too.

pdf_obj_enum
 - Non-iterable types should return an error.
 - Iterating over a stream doesn't make any sense.  If the client wants
   to iterate over its dictionary, they should provide the dictionary
   object.
 - Will iteration stop as soon as PDF_FALSE is returned?
 - Why is a callback used?  This will often be annoying to work with,
   especially given C's lack of closures (users will need to define and
   populate a struct, cast client_data to it, etc.).  It will also make
   interactions with other languages more difficult, e.g. if a client
   wants to throw an exception.

pdf_obj_equal
 - Should be "pdf_obj_equal_p".

    "If they are indirect, they share the same generation number."
 - We should also verify that they point to the same object.

    "If they are non-scalar, they reference the same value. "
 - What does "the same value" mean?  Does obj1 have to be the same
   pointer as obj2, or just point to a similar array/dict?
 - Will we compare stream contents?  This would be expensive.

pdf_obj_get_doc
 - PDF_EBADDATA could be used instead of PDF_EINVOBJ (elsewhere too).

pdf_obj_get_type
    "The returned value should be interpreted as a enum pdf_obj_type_e."
 - Why not use that type as the return value?
 - enum pdf_obj_type_e doesn't include an INDIRECT type.  Do we return
   the type of whatever it references?  If so,
    - What if we need to seek to find it, and get an error?
    - What if the indirect object points to another indirect object?
      How many levels do we go before giving up?

pdf_obj_get_id
 - When is an object ID assigned?  If the user creates an indirect
   object, but hasn't saved the file, do we already know the ID?

pdf_obj_compressed_p
 - Why would the client care about this?
 - Why does the object need to know where it came from?  I'd expect this
   (i.e. the location of an object identified by an ID and generation)
   to be a property of the document.

[3.3.3]
 - Why would we need to create strong references?  Normally the document
   will hold a reference (directly or indirectly) to each object, and if
   the client wants the library to write an unused object, it can avoid
   garbage collection.

[3.3.4]
 - Why can the client force incompressible objects?  Can't the library
   determine the appropriate way to write out each object?
 - Why can't we compress an object with a generation number of zero?  We
   could change it to zero when writing the file if we assign a new
   object ID.

[3.3.5]
 - The document doesn't explain what object collections are, but you
   stated in your email:
>   The object documents support the notion of "object collections",
>   that are eventually translated into one or more linked object
>   streams.

 - Why would the client need to manage them?

[3.3.6 -- 3.3.14]

pdf_obj_*_new
 - I agree with gerel that it doesn't seem right to require every object
   be associated with a document.  I'm not sure there should be a
   pointer from the object to the document at all -- what would it be
   used for?  The document can find its objects by following references
   from its root.
 - Why does each _new method have an indirect flag?  Another option
   would be to let the user create a direct object, and then create a
   reference to it when necessary (and we could also turn direct objects
   into indirect objects when writing, where the spec requires it).
 - If an indirect object is created, will the pdf_obj_*_value methods
   resolve the reference to return the value?
   (if so, my comments on pdf_obj_get_type apply here too)

pdf_obj_null_new
 - Why is the null constructor different from the rest?

pdf_obj_boolean_new
 - It seems silly to associate a boolean with a document, or allow
   indirect boolean objects if indirect nulls aren't possible.

pdf_obj_name_new
 - Presumably the name is null-terminated, so this should be stated.

pdf_obj_name_value, _string_value
 - It should be possible to get at the value without copying it.

pdf_obj_string_hex_p, _hex_set
 - I don't see any reason for this flag -- it's trivial to decide
   between escaped or hex strings when writing, and there should be no
   reason to care when reading.

arrays/dicts
 - What's a weak reference?  Is it related to strong references (3.3.3)?
   If so, it's redundant, and otherwise it's confusing.
 - How is object ownership handled?  Do the _remove methods delete the
   removed objects? (does it depend on whether it's weak?)  Do _get and
   _put copy the object?
    - When writing the parser, I found it was reasonable for containers
      to own the objects inside them (i.e. delete them when an object is
      removed, or when the container is destroyed).  The only difficulty
      was when transforming a collection, e.g. creating a dict based on
      an array: each item is temporarily owned by both containers, but
      they shouldn't be deleted when destroying the original container.

pdf_obj_array_insert
 - Clarify that index==pdf_obj_array_size() inserts at the end.
 - I'm not sure why an object would know whether it's in a container,
   but PDF_EINVOBJ implies it does.

pdf_obj_array_size
 - How are errors handled?  If we're resolving indirect refs, they'd
   need to be returned; otherwise errors are only possible when the
   object isn't an array, and we can leave that case undefined or return
   zero.

pdf_obj_array_remove
 - Is this useful? (maybe adding _find and using _remove_at would be
   better, but I'm not sure why you'd want to search within an array.)

pdf_obj_stream_new
 - There's no reason for indirect_p if it has to be PDF_TRUE.
 - source_length is redundant since attrs_dict must contain the length
   (if we have both, we'll have to explain which takes precedence and
   whether attrs_dict will be modified by the library to add a /Length
   element).
 - Maybe it should be possible to pass a callback function that will
   open and return a stream when necessary (since it may be impractical
   to open all streams ahead of time).
    - Or this type of callback could be handled at the stream layer, by
      creating a dummy stream type that executes callbacks to do the
      real work.

pdf_obj_stream_length
 - "length" should be a pointer.

[general]

The usage of "_get_" seems inconsistent, e.g. pdf_obj_get_type vs.
pdf_obj_integer_value.

The PDF spec says a reference to an object that doesn't exist is to be
treated as a reference to null.  Can we avoid assigning an ID that is
unused, but is referenced in this way? (since the reference would then
be pointing at some random object)

-- Michael

signature.asc
Description: Digital signature

Re: [pdf-devel] Object layer API

Reply via email to