All,
  Over on TIKA-1730 [0], we have a request to hide formatting info from 
header/footer records for both xls and xlsx during text extraction.  
  When I look at the text from FooterCell's getText(), it looks like we may 
want to add some parsing of the string to subcomponents for a 
HeaderCell/FooterCell.  Some useful information from Microsoft is here [1].
  
For example, from Tika's testExcel_headers_footers.xls file:

&LFooter - Corporate Spreadsheet&CFooter - For Internal Use Only&RFooter - 
Author: John Smith

Note, though, that the xlsx file already parses/separately stores the 
left/center/right components:
Footer - Corporate Spreadsheet Footer - For Internal Use Only Footer - Author: 
John Smith

From the TIKA-1730 .xls file:

&C&"Arial,Bold"&11&F

From the TIKA-1730.xlsx file:
&"Arial,Bold"&11&F

Has anyone worked with this area of our code base recently?  Is this something 
we should add/fix at the POI level or at the Tika level?

Thank you.

Cheers,

            Tim

[0] https://issues.apache.org/jira/browse/TIKA-1730
[1] https://support.microsoft.com/en-us/kb/142136 



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to