Re: [pdf-devel] Updated tokeniser/parser patch

jemarch Thu, 22 Jan 2009 14:48:04 -0800

   >    The next component to add would be a reader that can read
   >    the xref table and resolve indirect object references.
   > 
   > Before to proceed with the implementation of code pertaining to the
   > object layer, we need to complete other activities.


   Some parts of the code won't depend too much on the public API. For
   example, when opening a PDF file it will be necessary to read the
   version comment, check whether it's linearized, read the trailer,
   etc.

For example, take a look at 

   @deftypefun pdf_status_t pdf_obj_doc_save (pdf_obj_doc_t @var{doc}, 
pdf_fsys_file_t @var{file}, pdf_u32_t @var{flags}, struct 
pdf_obj_doc_save_params_s @var{params})

It is part of the public interface of the object layer that I am
drafting these days. The pdf_obj_doc_save_param_s structure has the
fields:

@deftp {Data Type} {struct pdf_obj_doc_save_params_s}

Parameters used when saving an object document.

@table @code
@item pdf_char_t *header
A complete header string, like
@example
%FOO-1.0
@end example
@item pdf_char_t *crypt_key
An encryption key if the security of the document is activated.
@item pdf_size_t crypt_key_size
Size of @code{crypt_key}.
@item struct pdf_pm_s *progress_monitor
A pointer to a progress monitor, or @code{NULL}.
@item void *monitor_client_data
Client-specific data for the progress monitor callbacks.
@end table
@end deftp

Look at the 'header' field. It turns that it is quite useful for the
client of the base layer to specify arbitrary headers, such as
%FDF-. It is a quite small example with not a lot of consequences, but
it shows the idea: we need to have a clear vision of the whole layer
before to start the implementation. That is the reason we are going to
work on the design for some days before to launch the development
tasks.

   > I estimate that by the end of this weekend I will finish the drafts
   > for:
   ...
   > - The overall design of the object layer:
   > 
   >   + What modules are we going to implement, the specific role of each
   >     module, and how modules collaborate to implement the public
   >     interface. This is not trivial. Among other things, we will have
   >     to decide how to implement the garbage collection of indirect
   >     objects when saving the document, how to manage the creation of
   >     new stream objects (the Acrobat Sdk uses temporary files, for
   >     example),

   Can you clarify what you mean by the creation of new stream objects? I
   don't see what temporary files would be used for. I'd expect the writer
   module to read from a pdf_stm_t to get at the filtered data (and use an
   indirect integer as the stream dictionary's /Length field). Maybe there
   would be a callback function the writer could use to set up that
   pdf_stm_t. Of course, it may pull data from a temp file, but I'd think
   that would be up to the user of the library.

It was just an example. The client of the base layer may want to
create a stream with quite large contents, such as a huge image. Then
it calls

    pdf_obj_stream_new (pdf_obj_doc_t @var{doc},
                        pdf_bool_t @var{indirect_p},
                        pdf_stm_t @var{stm},
                        pdf_i32_t @var{source_start},
                        pdf_bool_t @var{source_encode_p},
                        pdf_obj_t @var{attrs_dict},
                        pdf_obj_t @var{encode_parms_dict},
                        pdf_size_t @var{source_length},
                        pdf_obj_t *...@var{obj})

In that call STM (opened to read from some backend, for example the
huge file) is the data provider. Also, we want to filter that data
with some filter chain.

It is clear that we cannot create the stream structure (the stream
dictionary and the stream data) in memory: we could get out of it. The
Adobe Sdk seems to fix this problem creating a written PDF
representation of the stream object in a temporary file, something
like:

   @float Example,ex:temporary-stream-file
   @example
       10 0 obj
       << 
          /Length 20 0 R 
          other stream attributes...
       >>
       stream
   
   
       endstream
       endobj
       
       20 0 obj
       1024
       endobj
   @end example
   @caption{Stream object temporary file, like /tmp/fooXYZ}
   @end float

Then, when the user requests to save the file, the stream object is
read from the temporary file and written into the document, using some
base stm and also a parser. Since our implementation of the base stm
uses a limited amount of memory (despite the length of the filtered
data that goes through the filter) the implementation wont eat a lot
of memory even if working with huge files.

If the stream gets garbage-collected after the document save, and in
the same time there is not any strong reference pointing to it, then
the temporary file is deleted.

But I don't want to raise the debate right now, but in three days when
we get a base to work on: the draft of the design.

   >     and a large etc. Some decisions in this phase will have
   >     a direct impact on the implementation of the parser, for example:
   >     do we want to have a separate parser for xref tables?

   That would probably be best, since it really isn't in an
   operand/operator format like the other parts of the file. The
   tokeniser could still be used.

Yes, also it has a fixed format.

   >     can we use
   >     the same data type (pdf_obj_t) for both the public interface and
   >     to be used by the parser?, etc.

   It's obvious to me that we should, since many of the type (like strings)
   would end up being identical. One thing I wanted to discuss, though, was
   whether it's also OK to store tokens in pdf_obj_t.

It is not that obvious. The public interface should include a document
where to create the object in (see the pdf_obj_stream_new preliminary
declaration above) in order for it to register the object.

But, do we want to let the parser know about a pdf_obj_doc_t? I don't
think so (but I am neither sure about the contrary). Likely we will
need a different interface.

But again, I don't want to raise the debate right now.

   > Note also that the existing code in src/object/ is by all means
   > obsolete and useless. I wrote it too quickly and in the time before we
   > started to follow a more "structured" development method.

   Some of the changes I made were to make it match the current coding
   style, such as _new objects returning a pdf_status_t and filling in a
   pointer. Also, I removed the custom array and dict implementations and
   changed them to use pdf_list_t and pdf_hash_t.

   For now, I removed some of the object copying, and have the container
   structures owning their contents and returning references in the _get
   methods. I'm not sure whether we'll want to keep this, or to manage the
   memory in some other way.

   One thing that would reduce memory usage would be to allocate global
   objects for null, true, false, and maybe certain keywords and numbers.
   pdf_obj_null_new could then just return a pointer to the constant
   structure.

It is a pity that you dedicated all that time working on that
obsolete pdf_obj* code: quite probably we wont use it at all.

That is the reason we have a task manager with some tasks marked as
NEXT meaning that they are ready to be worked on: for the people
interested in help to take them.

But it is also my fault to keep that old obsolete code in the
repository. Sorry about that.

   >    One annoying thing I noticed was that pdf_list_t needs a heap allocation
   >    to use an iterator, which means that pdf_obj_equal_p could fail with an
   >    ENOMEM error (but currently has no way to return that error). It would
   >    be nice if the iterator could be kept on the stack -- struct
   >    pdf_list_iterator_s only contains one member anyway.
   > 
   > Gerel, what do you think about this? We would of course loose the
   > benefits of the opaque pointers in this case, but 'pdf_obj_equal_p'
   > throwing PDF_ENOMEM sounds quite weird, and I think that we could make an
   > exception and publish the iterator structure.

   We wouldn't necessarily lose all the benefits of opaque pointers. Only
   the size of the structure needs to be public in order to allocate it on
   the stack. The contents don't need to be documented, and we could leave
   some padding for future use as recommended in
     http://people.redhat.com/drepper/goodpractice.pdf

Good point! How that would be done for our public data types
implemented as structures? Can you provide an example?

   >    The gnulib list code will also need to be changed to return ENOMEM when
   >    necessary -- currently it just calls abort() when malloc fails.
   > 
   > Yes, we noticed that. Would be really nice to modify the list module
   > to not crash if 'xalloc_die' returns NULL (does nothing). Really, I
   > think that it is the only way to use the list module in a library.
   > 
   > I noticed that you are active in the gnulib development mailing
   > list. Would you want to raise the issue there?

   Yes, I was planning to raise it once I update c-strtod.

Superb! I don't know how many gnulib modules are relying on a
"killing" xalloc_die (I suspect that some of them) but would be really
good to have that fixed.

   > The (quite simple) parser for type 4 functions is implemented in
   > src/base/pdf-fp-func.[ch] and it already provides detection of syntax
   > errors and some run time errors.

   It's somewhat buggy though. I see numerous places where it will crash on
   ENOMEM, and since it uses strtod it will also fail in locales where the
   decimal point isn't "." (and I think it will also accept strings like
   "NAN" and numbers that use exponent notation, which shouldn't be
   accepted). Using common code where possible could make this more
   maintainable.

These are mainly lexical issues... maybe we could think about moving
the lexer module to the base layer. In that way the fp module could
use it in the little type 4 functions parser.

What do you (and people) think about that? If you agree I will open
some NEXT tasks to create the new module (and its tests, etc) and will
mark it as a dependency for the error reporting in type 4 functions.

   > For 'pdf_fp_func_4_new', since it is only used for type 4 functions,
   > the addition of a new parameter of type 'struct
   > *pdf_fp_func_4_errors_s' should do it.

   Why not just return an error status (pdf_status_t)?

   In general, how much detail should be provided when errors are
   detected?

In the case of the type 4 functions parser we need to be able to
generate detailed error messages: kind of error and line number
containing the error. Think on a PDF producer client application
letting the user to introduce a new pdf type 4 function definition
using some kind of user interface.

   In my code, I just return PDF_EBADFILE if I detect that the file
   isn't valid, since I don't expect users of the library to be
   interested in much more -- but I see a number of very specific
   errors in pdf-errors.h.

Well, still I am not really sure about that. I think that we should
not propagate the details about a invalid file to the document layer,
and pdf files are not text-based, but detailed information may be
useful for other purposes in the object layer. We will know while
working in the design.

-- 
Jose E. Marchesi  <jema...@gnu.org>
                  http://www.jemarch.net
GNU Project       http://www.gnu.org

Re: [pdf-devel] Updated tokeniser/parser patch

Reply via email to