[
https://issues.apache.org/jira/browse/TIKA-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-3706:
------------------------------
Summary: Add a parser for HTTPResponse? (was: Handful of docs incorrectly
identified as rfc822)
> Add a parser for HTTPResponse?
> ------------------------------
>
> Key: TIKA-3706
> URL: https://issues.apache.org/jira/browse/TIKA-3706
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Trivial
>
> In the recent regression tests, we found a small handful of docs now
> identified as rfc822.
>
> One example comes from PDFBox's jira
> (https://issues.apache.org/jira/browse/PDFBOX-2976 ):
> https://issues.apache.org/jira/secure/attachment/12757260/sc-356376.pdf
>
> As Tilman notes on the issue, the PDF actually includes http headers before
> the PDF:
> {noformat}
> HTTP/1.1 200 OK
> Cache-Control: private
> Pragma: Public
> Content-Type: application/pdf; charset=UTF-8
> Server: Microsoft-IIS/7.5
> Set-Cookie: ASP.NET_SessionId=ibc3nfydvyfh1z55zqis2q3y; path=/; HttpOnly
> Content-Disposition: inline; filename=_MTR_AGHS_EN.pdf
> X-AspNet-Version: 2.0.50727
> X-Powered-By: ASP.NET
> Date: Fri, 18 Sep 2015 17:30:08 GMT
> Content-Length: 56779
> %PDF-1.4
> {noformat}
> I'm not sure how or if we want to fix these. I'm going to look at the
> others. Y, others also come w HTTP Headers:
> https://corpora.tika.apache.org/base/docs/commoncrawl3/QL/QLPQA77R36REFEF3ICLL2NPTXWJXKV54
--
This message was sent by Atlassian Jira
(v8.20.1#820001)