[
https://issues.apache.org/jira/browse/TIKA-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-1958:
------------------------------
Attachment: excel_msword_2003.tar.bz2
Better grep, cleaner results. There's even one InfoPath 2003 doc in our corpus.
I have an initial patch for Word and Excel. Some work remains. I'll commit
once we have a successful vote for 1.13.
> Add mime detection and lightweight parsers for Office 2003 Word and Excel
> formats
> ---------------------------------------------------------------------------------
>
> Key: TIKA-1958
> URL: https://issues.apache.org/jira/browse/TIKA-1958
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Assignee: Tim Allison
> Priority: Minor
> Attachments: 2010-cal-eu.xls, excel_msword_2003.tar.bz2
>
>
> Over on POI, a user asked if we supported 2003 xls (xml) files. It would be
> neat if we could add mime detection and a "good enough" parser to handle 2003
> xls and doc files.
> This could be a great task for someone wanting to get started in contributing
> to Tika.
> references:
> https://mail-archives.apache.org/mod_mbox/poi-user/201604.mbox/%3Calpine.BSO.2.20.1604210825140.22929%40ref.nmedia.net%3E
> https://en.wikipedia.org/wiki/Microsoft_Office_XML_formats
> https://msdn.microsoft.com/en-us/library/bb226687(v=office.11).aspx
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)