On Fri, Feb 06, 2009 at 00:49:36 +0100, jema...@gnu.org wrote: > > Hi. > > I just committed in the trunk the first draft of the object layer > public API. It is documented in the "Object Layer" chapter of the > reference manual.
Can you post a version online that doesn't contain so many pages? The object layer chapter alone contains over 20 pages, which makes it very annoying to read -- a single downloadable file would be nice. Comments on specific items follow. [3.2.2] pdf_obj_doc_new - Why not always create an info dictionary by default? - The "&doc" parameter should be "*doc". pdf_obj_doc_open - Would it be better to take an open stream, rather than a filesystem and path? - What's the point of header_string? - The client probably doesn't care about the header -- they just want the library to open a valid PDF. - There may be more than one acceptable string, e.g. "%PDF-" and "%!PS-Adobe-N.n PDF-" - There's no mention of non-blocking I/O (this applies to other functions too). If we'll allow people to open PDF files over HTTP (or using other slow methods), we'll need some way to handle the case where we need to wait for data -- we may need to return an incomplete document object, since we might read the header and then get EWOULDBLOCK when trying to read the trailer or xref table(s). - Maybe we should have one function to create the pdf_obj_doc_t, and another to initialize it from a stream. The initialization function could return EWOULDBLOCK when necessary, and be called again when more data is available. pdf_obj_doc_close - Wouldn't the handle become invalid on PDF_OK also? pdf_obj_doc_save - Why wouldn't you want an xref table? It should probably be written by default. - Why should _SAVE_FULL imply _SAVE_DO_GC? Both flags could be given if desired. - If file==NULL, do we try to keep the same inode number? Is the file corrupted after an error? - Should there be a separate GC function instead of a flag? - Again, there should be some consideration for non-blocking I/O. - Why does the client need to care about the header string? - If they specify an old version, do we ensure the output is compatible with that version? What do we do with unrecognized versions? - I think we should let the user specify some kind of compatibility level, and let the library handle details like this. - How do we track modifications when _SAVE_FULL is not used? [3.2.3] pdf_obj_doc_get_id - Rather than pdf_char_t*/pdf_size_t pairs, we could use a pdf_obj_t with string type. pdf_obj_doc_set_dirty "If the dirty flag of a document is set then its contents are saved before the document is closed." - Does this mean doc_close will write a file? That doesn't seem right. [3.3.1] pdf_obj_gen_t - Should be described as "non-negative", not "positive". [3.3.2] pdf_obj_copy - If copy_indirect is false, what happens when an indirect object is seen? - Where does the object go? I.e., if you throw away whatever's returned in *dest, does dest_doc still have some reference to it? - If an object didn't have a back-reference to a document, something like pdf_obj_dup would be sufficient. - If a stream is copied, does closing the original file invalidate it? (if not, where's the data stored?) pdf_obj_destroy "This function is a nop if obj is a direct scalar type or the Null object." - Why would it be a nop? Scalar objects need to be freed too. pdf_obj_enum - Non-iterable types should return an error. - Iterating over a stream doesn't make any sense. If the client wants to iterate over its dictionary, they should provide the dictionary object. - Will iteration stop as soon as PDF_FALSE is returned? - Why is a callback used? This will often be annoying to work with, especially given C's lack of closures (users will need to define and populate a struct, cast client_data to it, etc.). It will also make interactions with other languages more difficult, e.g. if a client wants to throw an exception. pdf_obj_equal - Should be "pdf_obj_equal_p". "If they are indirect, they share the same generation number." - We should also verify that they point to the same object. "If they are non-scalar, they reference the same value. " - What does "the same value" mean? Does obj1 have to be the same pointer as obj2, or just point to a similar array/dict? - Will we compare stream contents? This would be expensive. pdf_obj_get_doc - PDF_EBADDATA could be used instead of PDF_EINVOBJ (elsewhere too). pdf_obj_get_type "The returned value should be interpreted as a enum pdf_obj_type_e." - Why not use that type as the return value? - enum pdf_obj_type_e doesn't include an INDIRECT type. Do we return the type of whatever it references? If so, - What if we need to seek to find it, and get an error? - What if the indirect object points to another indirect object? How many levels do we go before giving up? pdf_obj_get_id - When is an object ID assigned? If the user creates an indirect object, but hasn't saved the file, do we already know the ID? pdf_obj_compressed_p - Why would the client care about this? - Why does the object need to know where it came from? I'd expect this (i.e. the location of an object identified by an ID and generation) to be a property of the document. [3.3.3] - Why would we need to create strong references? Normally the document will hold a reference (directly or indirectly) to each object, and if the client wants the library to write an unused object, it can avoid garbage collection. [3.3.4] - Why can the client force incompressible objects? Can't the library determine the appropriate way to write out each object? - Why can't we compress an object with a generation number of zero? We could change it to zero when writing the file if we assign a new object ID. [3.3.5] - The document doesn't explain what object collections are, but you stated in your email: > The object documents support the notion of "object collections", > that are eventually translated into one or more linked object > streams. - Why would the client need to manage them? [3.3.6 -- 3.3.14] pdf_obj_*_new - I agree with gerel that it doesn't seem right to require every object be associated with a document. I'm not sure there should be a pointer from the object to the document at all -- what would it be used for? The document can find its objects by following references from its root. - Why does each _new method have an indirect flag? Another option would be to let the user create a direct object, and then create a reference to it when necessary (and we could also turn direct objects into indirect objects when writing, where the spec requires it). - If an indirect object is created, will the pdf_obj_*_value methods resolve the reference to return the value? (if so, my comments on pdf_obj_get_type apply here too) pdf_obj_null_new - Why is the null constructor different from the rest? pdf_obj_boolean_new - It seems silly to associate a boolean with a document, or allow indirect boolean objects if indirect nulls aren't possible. pdf_obj_name_new - Presumably the name is null-terminated, so this should be stated. pdf_obj_name_value, _string_value - It should be possible to get at the value without copying it. pdf_obj_string_hex_p, _hex_set - I don't see any reason for this flag -- it's trivial to decide between escaped or hex strings when writing, and there should be no reason to care when reading. arrays/dicts - What's a weak reference? Is it related to strong references (3.3.3)? If so, it's redundant, and otherwise it's confusing. - How is object ownership handled? Do the _remove methods delete the removed objects? (does it depend on whether it's weak?) Do _get and _put copy the object? - When writing the parser, I found it was reasonable for containers to own the objects inside them (i.e. delete them when an object is removed, or when the container is destroyed). The only difficulty was when transforming a collection, e.g. creating a dict based on an array: each item is temporarily owned by both containers, but they shouldn't be deleted when destroying the original container. pdf_obj_array_insert - Clarify that index==pdf_obj_array_size() inserts at the end. - I'm not sure why an object would know whether it's in a container, but PDF_EINVOBJ implies it does. pdf_obj_array_size - How are errors handled? If we're resolving indirect refs, they'd need to be returned; otherwise errors are only possible when the object isn't an array, and we can leave that case undefined or return zero. pdf_obj_array_remove - Is this useful? (maybe adding _find and using _remove_at would be better, but I'm not sure why you'd want to search within an array.) pdf_obj_stream_new - There's no reason for indirect_p if it has to be PDF_TRUE. - source_length is redundant since attrs_dict must contain the length (if we have both, we'll have to explain which takes precedence and whether attrs_dict will be modified by the library to add a /Length element). - Maybe it should be possible to pass a callback function that will open and return a stream when necessary (since it may be impractical to open all streams ahead of time). - Or this type of callback could be handled at the stream layer, by creating a dummy stream type that executes callbacks to do the real work. pdf_obj_stream_length - "length" should be a pointer. [general] The usage of "_get_" seems inconsistent, e.g. pdf_obj_get_type vs. pdf_obj_integer_value. The PDF spec says a reference to an object that doesn't exist is to be treated as a reference to null. Can we avoid assigning an ID that is unused, but is referenced in this way? (since the reference would then be pointing at some random object) -- Michael
signature.asc
Description: Digital signature