Nick Burch created TIKA-2641:
--------------------------------
Summary: Unit test for consistency between tabular/columnar formats
Key: TIKA-2641
URL: https://issues.apache.org/jira/browse/TIKA-2641
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 1.18, 2.0
Reporter: Nick Burch
We now have a number of parsers which deal with file formats which are either
wholey or optionally "table-based" formats with consistency in the data types
held in a given column. This includes multi-table formats like sqlite,
single-table formats like sas7bdat, and anything-goes-table formats like csv or
xlsx
We should firstly try to create a simple-ish, small but rich file for each of
these formats, similar to what we do for archive formats with the
{{test-documents}} archives. Then, we should add unit tests that verified that,
as much as formats permit, you get basically the same XHTML out for the "same"
input. Oh, and fix up any obvious inconsistencies...
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)