All,
Over on TIKA-1730 [0], we have a request to hide formatting info from
header/footer records for both xls and xlsx during text extraction.
When I look at the text from FooterCell's getText(), it looks like we may
want to add some parsing of the string to subcomponents for a
HeaderCell/FooterCell. Some useful information from Microsoft is here [1].
For example, from Tika's testExcel_headers_footers.xls file:
&LFooter - Corporate Spreadsheet&CFooter - For Internal Use Only&RFooter -
Author: John Smith
Note, though, that the xlsx file already parses/separately stores the
left/center/right components:
Footer - Corporate Spreadsheet Footer - For Internal Use Only Footer - Author:
John Smith
From the TIKA-1730 .xls file:
&C&"Arial,Bold"&11&F
From the TIKA-1730.xlsx file:
&"Arial,Bold"&11&F
Has anyone worked with this area of our code base recently? Is this something
we should add/fix at the POI level or at the Tika level?
Thank you.
Cheers,
Tim
[0] https://issues.apache.org/jira/browse/TIKA-1730
[1] https://support.microsoft.com/en-us/kb/142136
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]