[
https://issues.apache.org/jira/browse/COR-31?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
jan iversen updated COR-31:
---------------------------
Component/s: Wiki
> Identification of Document Format Tool Progressions: Access, Creation,
> Testing, Assessment, Validation, Forensics
> -----------------------------------------------------------------------------------------------------------------
>
> Key: COR-31
> URL: https://issues.apache.org/jira/browse/COR-31
> Project: Corinthia
> Issue Type: Task
> Components: Wiki
> Reporter: Dennis E. Hamilton
> Labels: document-forensics, document-format, document-standards,
> document-validation, documents, file-structure, test-suite
>
> There are many needs, and opportunities, for command-line and library-level
> tools that support the development of processors for different document
> formats.
> Many small tools can be developed as part of the application and verification
> of what will be larger solutions with regard to particular formats.
> This task is for identification of which such tools will be defined as
> work-product and deliverables for Corinthia, even in an initial provisional
> list. Having an identified structure points for defined deliverables should
> aid in having different aspects of Corinthia available for development and
> testing by many hands and eyes.
> SKETCH
> There are different levels of tools, and the layers provide fixtures for
> exercising lower layers of code and also composing them into layers above.
> To be concrete, here is a sketch of the levels of tooling that can be
> byproducts and aids in the confirmation of correct handling of a document
> format.
> There are two "raw" formats that are handled in building document files of
> interest to us: text files and Zip packages (or other carriers of composite
> structures, such as MIME multi-part, tar files, Microsoft DocFiles, etc.).
> There are flat file formats atop text-file formats. Examples are Microsoft
> RTF, XML, and HTML. These are accompanied by character-set encoding
> variations that must be dealt with. There are also cases of linking that
> arise in these formats.
> RTF is a document format. XML carries document formats such as the
> single-file ODF format, the single-file XML formats defined for Microsoft
> Office, etc. There are already HTML-format usages that provide for fidelity
> preservation in round trip between HTML and Microsoft Office formats. There
> may be something similar that has lived in OpenOffice.org. These are very
> handy formats for creation of simple test documents that exercise the
> respective document models. They also provide experience with the document
> formats and efforts to abstract the document that is represented in those
> formats.
> Zip usage as carriers raises its own needs for well-defined tools, both for
> use in the inspection of document files but also the validation and forensic
> analysis of the Zip usage for ODF, OOXML, and other formats, such as ePub.
> Now we're dealing with composite document files with multiple parts using
> flat formats, such as HTML and XML, and other formats, including binary
> formats not mentioned as part of this progressive layering. There are now
> more elaborate structures to abstract from the parts of the Zip package and
> the cross-references among them.
> These are all tooling opportunities and they support the testing and
> confirmation of the development of the document-processing functions that
> Corinthia makes available.
> The richness of this can be illustrated by the need for forensic and
> validation tools and how they may become interdependent.
> Consider the simple verification of a Zip file. There are two levels of
> verification that matter.
> First there is of the fundamental invariant structure that a Zip archive must
> possess. In practical use, it is desirable to rapidly abstract the presence
> of a correct Zip and its components. It is desirable to be able to produce
> or update one efficiently. One wants a fail-safe and resilient response
> when an unacceptable Zip is encountered.
> At the same time, one wants a way to assess and inspect a Zip that is
> well-formed or is considered defective. A separate tool would be handier for
> that, but needed to support document processing by providing inspection and
> reporting of how the Zip is unacceptable. That's more involved and not
> something one wants to endure just to get going working with a document. At
> the same time, there is a good case for some reused common code as well, and
> these kinds of tools aid in the confirmation of that code too.
> Suppose a Zip is concluded to be damaged. Another level is goes beyond
> detection of damage to determination of how much of the Zip can be recovered
> and what to do with the areas of damage. This is about rescuing documents.
> Yet another opportunity. Yet another elaborate use that can involve some
> shared underlying code.
> We're now at the second level and that intersects with the use of a Zip as a
> particular kind of document container. A zip may be well-formed, but there
> are additional limitations and functions that go into recognizing the Zip
> usage as a carrier of a particular document format. It can even be a generic
> carrier format, such as the Open Packaging Conventions (OPC) used for
> carrying OOXML, XPS, and other artifacts, and the OpenDocument 1.2 Package
> used for carrying ODF.
> There need to be analysis and inspection tools at this second level of
> generic Zip usage. This also has a cross-over value in the forensic problem
> of recovering what is recoverable in a damaged Zip archive. When it is known
> what additional structure is expected to be present, this can inform the
> identification of breakage and determination of loss.
> It's not all one-sided. What appears to be a well-formed Zip package for a
> given document format can still expose damage in the recording or compression
> (oh yes, compression and decompression) of any of its parts.
> This sketch is still at the plumbing level. The abstraction of document
> features is yet to happen. That's raising up another level.
> This is all just to point out how many opportunities for tools and supporting
> libraries there are. The tools are important for bootstrapping up the levels
> of Corinthia and for being able to check our own work, to devise tests and
> demonstrations, and to provide forensic support in the face of problems that
> may arise in the software or simply in circumstances that arise for users.
> The idea behind this task and its subtasks is to see what could be identified
> as point deliverables, even if fundamentally for our own work process, so
> that they become definable and something to work on, to be available in
> higher levels of operation, etc.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)