[ 
https://issues.apache.org/jira/browse/COR-31?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jan iversen updated COR-31:
---------------------------
    Component/s: Wiki

> Identification of Document Format Tool Progressions: Access, Creation, 
> Testing, Assessment, Validation, Forensics
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: COR-31
>                 URL: https://issues.apache.org/jira/browse/COR-31
>             Project: Corinthia
>          Issue Type: Task
>          Components: Wiki
>            Reporter: Dennis E. Hamilton
>              Labels: document-forensics, document-format, document-standards, 
> document-validation, documents, file-structure, test-suite
>
> There are many needs, and opportunities, for command-line and library-level 
> tools that support the development of processors for different document 
> formats.  
> Many small tools can be developed as part of the application and verification 
> of what will be larger solutions with regard to particular formats.  
> This task is for identification of which such tools will be defined as 
> work-product and deliverables for Corinthia, even in an initial provisional 
> list.  Having an identified structure points for defined deliverables should 
> aid in having different aspects of Corinthia available for development and 
> testing by many hands and eyes.
> SKETCH
> There are different levels of tools, and the layers provide fixtures for 
> exercising lower layers of code and also composing them into layers above.
> To be concrete, here is a sketch of the levels of tooling that can be 
> byproducts and aids in the confirmation of correct handling of a document 
> format.
> There are two "raw" formats that are handled in building document files of 
> interest to us: text files and Zip packages (or other carriers of composite 
> structures, such as MIME multi-part, tar files, Microsoft DocFiles, etc.).  
> There are flat file formats atop text-file formats.  Examples are Microsoft 
> RTF, XML, and HTML.  These are accompanied by character-set encoding 
> variations that must be dealt with.  There are also cases of linking that 
> arise in these formats.
> RTF is a document format.  XML carries document formats such as the 
> single-file ODF format, the single-file XML formats defined for Microsoft 
> Office, etc.  There are already HTML-format usages that provide for fidelity 
> preservation in round trip between HTML and Microsoft Office formats.  There 
> may be something similar that has lived in OpenOffice.org.  These are very 
> handy formats for creation of simple test documents that exercise the 
> respective document models.  They also provide experience with the document 
> formats and efforts to abstract the document that is represented in those 
> formats.
> Zip usage as carriers raises its own needs for well-defined tools, both for 
> use in the inspection of document files but also the validation and forensic 
> analysis of the Zip usage for ODF, OOXML, and other formats, such as ePub.  
> Now we're dealing with composite document files with multiple parts using 
> flat formats, such as HTML and XML, and other formats, including binary 
> formats not mentioned as part of this progressive layering.  There are now 
> more elaborate structures to abstract from the parts of the Zip package and 
> the cross-references among them.
> These are all tooling opportunities and they support the testing and 
> confirmation of the development of the document-processing functions that 
> Corinthia makes available.
> The richness of this can be illustrated by the need for forensic and 
> validation tools and how they may become interdependent.
> Consider the simple verification of a Zip file.  There are two levels of 
> verification that matter.  
> First there is of the fundamental invariant structure that a Zip archive must 
> possess.  In practical use, it is desirable to rapidly abstract the presence 
> of a correct Zip and its components.  It is desirable to be able to produce 
> or update one efficiently.   One wants a fail-safe and resilient response 
> when an unacceptable Zip is encountered.
> At the same time, one wants a way to assess and inspect a Zip that is 
> well-formed or is considered defective.  A separate tool would be handier for 
> that, but needed to support document processing by providing inspection and 
> reporting of how the Zip is unacceptable.  That's more involved and not 
> something one wants to endure just to get going working with a document.  At 
> the same time, there is a good case for some reused common code as well, and 
> these kinds of tools aid in the confirmation of that code too.
> Suppose a Zip is concluded to be damaged.  Another level is goes beyond 
> detection of damage to determination of how much of the Zip can be recovered 
> and what to do with the areas of damage.  This is about rescuing documents.  
> Yet another opportunity.  Yet another elaborate use that can involve some 
> shared underlying code.
> We're now at the second level and that intersects with the use of a Zip as a 
> particular kind of document container.  A zip may be well-formed, but there 
> are additional limitations and functions that go into recognizing the Zip 
> usage as a carrier of a particular document format.  It can even be a generic 
> carrier format, such as the Open Packaging Conventions (OPC) used for 
> carrying OOXML, XPS, and other artifacts, and the OpenDocument 1.2 Package 
> used for carrying ODF.
> There need to be analysis and inspection tools at this second level of 
> generic Zip usage.  This also has a cross-over value in the forensic problem 
> of recovering what is recoverable in a damaged Zip archive.  When it is known 
> what additional structure is expected to be present, this can inform the 
> identification of breakage and determination of loss.
> It's not all one-sided.  What appears to be a well-formed Zip package for a 
> given document format can still expose damage in the recording or compression 
> (oh yes, compression and decompression) of any of its parts.
> This sketch is still at the plumbing level.  The abstraction of document 
> features is yet to happen.  That's raising up another level.
> This is all just to point out how many opportunities for tools and supporting 
> libraries there are. The tools are important for bootstrapping up the levels 
> of Corinthia and for being able to check our own work, to devise tests and 
> demonstrations, and to provide forensic support in the face of problems that 
> may arise in the software or simply in circumstances that arise for users.
> The idea behind this task and its subtasks is to see what could be identified 
> as point deliverables, even if fundamentally for our own work process, so 
> that they become definable and something to work on, to be available in 
> higher levels of operation, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to