[ 
https://issues.apache.org/jira/browse/TIKA-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470671#comment-16470671
 ] 

Nick Burch commented on TIKA-2641:
----------------------------------

I've generated test files for CSV, SAS7BDAT, XLS and XLSX using a small SAS 
program, and an ODS file manually. No DB format files generated yet

Currently, the XLS and XLSX file tests are part-disabled, because blank cells 
are being skipped in the sax output. Not sure if we want to enable 
missing-cells support in POI and output empty td's for these or not?

The ODS test is disabled because we're getting a "org.xml.sax.SAXException: 
Namespace http://www.w3.org/1999/xhtml not declared" error when trying to 
generate the XML version of the file with the TikaTest helper. Not sure if this 
is highlighting a parser bug on simple files, or a unit test helper mistake?

CSV is part-enabled, because we don't have a dedicated CSV parser we just get 
plain text output.

SAS7BDAT testing is enabled.

> Unit test for consistency between tabular/columnar formats
> ----------------------------------------------------------
>
>                 Key: TIKA-2641
>                 URL: https://issues.apache.org/jira/browse/TIKA-2641
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.0, 1.18
>            Reporter: Nick Burch
>            Priority: Minor
>
> We now have a number of parsers which deal with file formats which are either 
> wholey or optionally "table-based" formats with consistency in the data types 
> held in a given column. This includes multi-table formats like sqlite, 
> single-table formats like sas7bdat, and anything-goes-table formats like csv 
> or xlsx
> We should firstly try to create a simple-ish, small but rich file for each of 
> these formats, similar to what we do for archive formats with the 
> {{test-documents}} archives. Then, we should add unit tests that verified 
> that, as much as formats permit, you get basically the same XHTML out for the 
> "same" input. Oh, and fix up any obvious inconsistencies...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to