> The next component to add would be a reader that can read > the xref table and resolve indirect object references. > > Before to proceed with the implementation of code pertaining to the > object layer, we need to complete other activities.
Some parts of the code won't depend too much on the public API. For example, when opening a PDF file it will be necessary to read the version comment, check whether it's linearized, read the trailer, etc. For example, take a look at @deftypefun pdf_status_t pdf_obj_doc_save (pdf_obj_doc_t @var{doc}, pdf_fsys_file_t @var{file}, pdf_u32_t @var{flags}, struct pdf_obj_doc_save_params_s @var{params}) It is part of the public interface of the object layer that I am drafting these days. The pdf_obj_doc_save_param_s structure has the fields: @deftp {Data Type} {struct pdf_obj_doc_save_params_s} Parameters used when saving an object document. @table @code @item pdf_char_t *header A complete header string, like @example %FOO-1.0 @end example @item pdf_char_t *crypt_key An encryption key if the security of the document is activated. @item pdf_size_t crypt_key_size Size of @code{crypt_key}. @item struct pdf_pm_s *progress_monitor A pointer to a progress monitor, or @code{NULL}. @item void *monitor_client_data Client-specific data for the progress monitor callbacks. @end table @end deftp Look at the 'header' field. It turns that it is quite useful for the client of the base layer to specify arbitrary headers, such as %FDF-. It is a quite small example with not a lot of consequences, but it shows the idea: we need to have a clear vision of the whole layer before to start the implementation. That is the reason we are going to work on the design for some days before to launch the development tasks. > I estimate that by the end of this weekend I will finish the drafts > for: ... > - The overall design of the object layer: > > + What modules are we going to implement, the specific role of each > module, and how modules collaborate to implement the public > interface. This is not trivial. Among other things, we will have > to decide how to implement the garbage collection of indirect > objects when saving the document, how to manage the creation of > new stream objects (the Acrobat Sdk uses temporary files, for > example), Can you clarify what you mean by the creation of new stream objects? I don't see what temporary files would be used for. I'd expect the writer module to read from a pdf_stm_t to get at the filtered data (and use an indirect integer as the stream dictionary's /Length field). Maybe there would be a callback function the writer could use to set up that pdf_stm_t. Of course, it may pull data from a temp file, but I'd think that would be up to the user of the library. It was just an example. The client of the base layer may want to create a stream with quite large contents, such as a huge image. Then it calls pdf_obj_stream_new (pdf_obj_doc_t @var{doc}, pdf_bool_t @var{indirect_p}, pdf_stm_t @var{stm}, pdf_i32_t @var{source_start}, pdf_bool_t @var{source_encode_p}, pdf_obj_t @var{attrs_dict}, pdf_obj_t @var{encode_parms_dict}, pdf_size_t @var{source_length}, pdf_obj_t *...@var{obj}) In that call STM (opened to read from some backend, for example the huge file) is the data provider. Also, we want to filter that data with some filter chain. It is clear that we cannot create the stream structure (the stream dictionary and the stream data) in memory: we could get out of it. The Adobe Sdk seems to fix this problem creating a written PDF representation of the stream object in a temporary file, something like: @float Example,ex:temporary-stream-file @example 10 0 obj << /Length 20 0 R other stream attributes... >> stream endstream endobj 20 0 obj 1024 endobj @end example @caption{Stream object temporary file, like /tmp/fooXYZ} @end float Then, when the user requests to save the file, the stream object is read from the temporary file and written into the document, using some base stm and also a parser. Since our implementation of the base stm uses a limited amount of memory (despite the length of the filtered data that goes through the filter) the implementation wont eat a lot of memory even if working with huge files. If the stream gets garbage-collected after the document save, and in the same time there is not any strong reference pointing to it, then the temporary file is deleted. But I don't want to raise the debate right now, but in three days when we get a base to work on: the draft of the design. > and a large etc. Some decisions in this phase will have > a direct impact on the implementation of the parser, for example: > do we want to have a separate parser for xref tables? That would probably be best, since it really isn't in an operand/operator format like the other parts of the file. The tokeniser could still be used. Yes, also it has a fixed format. > can we use > the same data type (pdf_obj_t) for both the public interface and > to be used by the parser?, etc. It's obvious to me that we should, since many of the type (like strings) would end up being identical. One thing I wanted to discuss, though, was whether it's also OK to store tokens in pdf_obj_t. It is not that obvious. The public interface should include a document where to create the object in (see the pdf_obj_stream_new preliminary declaration above) in order for it to register the object. But, do we want to let the parser know about a pdf_obj_doc_t? I don't think so (but I am neither sure about the contrary). Likely we will need a different interface. But again, I don't want to raise the debate right now. > Note also that the existing code in src/object/ is by all means > obsolete and useless. I wrote it too quickly and in the time before we > started to follow a more "structured" development method. Some of the changes I made were to make it match the current coding style, such as _new objects returning a pdf_status_t and filling in a pointer. Also, I removed the custom array and dict implementations and changed them to use pdf_list_t and pdf_hash_t. For now, I removed some of the object copying, and have the container structures owning their contents and returning references in the _get methods. I'm not sure whether we'll want to keep this, or to manage the memory in some other way. One thing that would reduce memory usage would be to allocate global objects for null, true, false, and maybe certain keywords and numbers. pdf_obj_null_new could then just return a pointer to the constant structure. It is a pity that you dedicated all that time working on that obsolete pdf_obj* code: quite probably we wont use it at all. That is the reason we have a task manager with some tasks marked as NEXT meaning that they are ready to be worked on: for the people interested in help to take them. But it is also my fault to keep that old obsolete code in the repository. Sorry about that. > One annoying thing I noticed was that pdf_list_t needs a heap allocation > to use an iterator, which means that pdf_obj_equal_p could fail with an > ENOMEM error (but currently has no way to return that error). It would > be nice if the iterator could be kept on the stack -- struct > pdf_list_iterator_s only contains one member anyway. > > Gerel, what do you think about this? We would of course loose the > benefits of the opaque pointers in this case, but 'pdf_obj_equal_p' > throwing PDF_ENOMEM sounds quite weird, and I think that we could make an > exception and publish the iterator structure. We wouldn't necessarily lose all the benefits of opaque pointers. Only the size of the structure needs to be public in order to allocate it on the stack. The contents don't need to be documented, and we could leave some padding for future use as recommended in http://people.redhat.com/drepper/goodpractice.pdf Good point! How that would be done for our public data types implemented as structures? Can you provide an example? > The gnulib list code will also need to be changed to return ENOMEM when > necessary -- currently it just calls abort() when malloc fails. > > Yes, we noticed that. Would be really nice to modify the list module > to not crash if 'xalloc_die' returns NULL (does nothing). Really, I > think that it is the only way to use the list module in a library. > > I noticed that you are active in the gnulib development mailing > list. Would you want to raise the issue there? Yes, I was planning to raise it once I update c-strtod. Superb! I don't know how many gnulib modules are relying on a "killing" xalloc_die (I suspect that some of them) but would be really good to have that fixed. > The (quite simple) parser for type 4 functions is implemented in > src/base/pdf-fp-func.[ch] and it already provides detection of syntax > errors and some run time errors. It's somewhat buggy though. I see numerous places where it will crash on ENOMEM, and since it uses strtod it will also fail in locales where the decimal point isn't "." (and I think it will also accept strings like "NAN" and numbers that use exponent notation, which shouldn't be accepted). Using common code where possible could make this more maintainable. These are mainly lexical issues... maybe we could think about moving the lexer module to the base layer. In that way the fp module could use it in the little type 4 functions parser. What do you (and people) think about that? If you agree I will open some NEXT tasks to create the new module (and its tests, etc) and will mark it as a dependency for the error reporting in type 4 functions. > For 'pdf_fp_func_4_new', since it is only used for type 4 functions, > the addition of a new parameter of type 'struct > *pdf_fp_func_4_errors_s' should do it. Why not just return an error status (pdf_status_t)? In general, how much detail should be provided when errors are detected? In the case of the type 4 functions parser we need to be able to generate detailed error messages: kind of error and line number containing the error. Think on a PDF producer client application letting the user to introduce a new pdf type 4 function definition using some kind of user interface. In my code, I just return PDF_EBADFILE if I detect that the file isn't valid, since I don't expect users of the library to be interested in much more -- but I see a number of very specific errors in pdf-errors.h. Well, still I am not really sure about that. I think that we should not propagate the details about a invalid file to the document layer, and pdf files are not text-based, but detailed information may be useful for other purposes in the object layer. We will know while working in the design. -- Jose E. Marchesi <jema...@gnu.org> http://www.jemarch.net GNU Project http://www.gnu.org