[
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506214#comment-14506214
]
Tim Allison edited comment on TIKA-1513 at 4/22/15 11:06 AM:
-------------------------------------------------------------
In looking at [this|http://www.dbf2002.com/dbf-file-format.html], I wonder if
we could add 0x00 at 30 and 31?
In govdocs1, files that start with 0x03:
||file suffix||count||
|dbase3|2601|
|gls|60|
|bin|1|
In commoncrawl:
||file suffix||count||
|dbf| 532|
|ndx| 40|
|dct| 33|
|tfm| 12|
|ctg| 11|
|_bf| 2|
|cti| 2|
|stp| 2
|NO_SUFFIX| 2|
|a04| 1|
|a05| 1|
|fw| 1|
|mxp| 1|
|pyc| 1|
|txt| 1|
was (Author: [email protected]):
In looking at [this|http://www.dbf2002.com/dbf-file-format.html], I wonder if
we could add 0x00 at 30 and 31?
I'm currently grepping the Common Crawl slice from Julien Nioche for files
starting with 0x03, and I'm getting a vast majority ".dbf", but there are some
that end in .dct, .ndx (dbf index?), .tfm, .ctg... Will report findings
tomorrow.
> Add mime detection and parsing for dbf files
> --------------------------------------------
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
> Fix For: 1.9
>
>
> I just came across an Apache licensed dbf parser that is available on
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)