[
https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167063#comment-16167063
]
Nick Burch commented on TIKA-2462:
----------------------------------
I've just had a quick try with the library, against a test SAS file with 5
columns each of different types. Looking at the properties on the file, and on
the columns, Parso is able to return:
{{{
u64 - false
compressionMethod - null
endianness - 1
encoding - windows-1252
sessionEncoding - null
name - SHEET1
fileType - DATA
dateCreated - Fri Mar 06 19:10:19 GMT 2015
dateModified - Fri Mar 06 19:10:19 GMT 2015
sasRelease - 9.0101M3
serverType - XP_PRO
osName -
osType -
headerLength - 1024
pageLength - 8192
pageCount - 1
rowLength - 96
rowCount - 31
mixPageRowCount - 69
columnsCount - 5
5 Columns defined:
1 - A
Label: A
Format: $
Size 58 of java.lang.String
2 - B
Label: B
Format:
Size 8 of java.lang.Number
3 - C
Label: C
Format: DATE
Size 8 of java.lang.Number
4 - D
Label: D
Format: DATETIME
Size 8 of java.lang.Number
5 - E
Label: E
Format:
Size 8 of java.lang.Number
}}}
I guess we'd want to map some of the file properties onto standard keys, and
the rest onto custom ones? For the data, I guess we output SAX events for a
HTML-like table. Not sure about the column metadata, any patterns we can copy
from any of the database formats or other scientific dataset formats?
Also, we only seem to have 1 fairly simple test sas7bdat file in the Tika
Parsers test documents area. Do we have a standard "moderately complicated"
tabular test file (eg XLS, CSV) which I could get a SAS version made of, so we
can have largely the same test data between formats?
> Add a parser for sas7bdat
> -------------------------
>
> Key: TIKA-2462
> URL: https://issues.apache.org/jira/browse/TIKA-2462
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
>
> EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate
> parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19
> !!!
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)