[ https://issues.apache.org/jira/browse/TIKA-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633511#comment-14633511 ]
Tim Allison commented on TIKA-1238: ----------------------------------- Got it. For now, let's see if I can find some triggering files in a fresh pull of .msg files from CommonCrawl via [~centic]'s very handy CommonCrawl [downloader|https://github.com/centic9/CommonCrawlDocumentDownload]. > Update OutlookExtractor to handle codepage identification more rigorously > ------------------------------------------------------------------------- > > Key: TIKA-1238 > URL: https://issues.apache.org/jira/browse/TIKA-1238 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Tim Allison > Assignee: Tim Allison > Priority: Minor > Fix For: 1.10 > > > Since OutlookExtractor's codepage detection chunk was written, POI's HSMF has > added more robutst capabilities for identifying codepages in Outlook .msg > files. As a first step to integrating those improvements, I'll copy and > paste some of POI's code into OutlookExtractor. As a second step, I'll > expose more of HSMF's capabilities within POI and then factor out the > duplicate code in Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)