Nick Burch created TIKA-2641:
--------------------------------

             Summary: Unit test for consistency between tabular/columnar formats
                 Key: TIKA-2641
                 URL: https://issues.apache.org/jira/browse/TIKA-2641
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.18, 2.0
            Reporter: Nick Burch


We now have a number of parsers which deal with file formats which are either 
wholey or optionally "table-based" formats with consistency in the data types 
held in a given column. This includes multi-table formats like sqlite, 
single-table formats like sas7bdat, and anything-goes-table formats like csv or 
xlsx

We should firstly try to create a simple-ish, small but rich file for each of 
these formats, similar to what we do for archive formats with the 
{{test-documents}} archives. Then, we should add unit tests that verified that, 
as much as formats permit, you get basically the same XHTML out for the "same" 
input. Oh, and fix up any obvious inconsistencies...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to