[jira] [Updated] (TIKA-3706) Add a parser for HTTPResponse?

Tim Allison (Jira) Fri, 25 Mar 2022 03:19:04 -0700


     [ 
https://issues.apache.org/jira/browse/TIKA-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison updated TIKA-3706:
------------------------------
    Summary: Add a parser for HTTPResponse?  (was: Handful of docs incorrectly 
identified as rfc822)

> Add a parser for HTTPResponse?
> ------------------------------
>
>                 Key: TIKA-3706
>                 URL: https://issues.apache.org/jira/browse/TIKA-3706
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Trivial
>
> In the recent regression tests, we found a small handful of docs now 
> identified as rfc822.
>  
> One example comes from PDFBox's jira 
> (https://issues.apache.org/jira/browse/PDFBOX-2976 ):
> https://issues.apache.org/jira/secure/attachment/12757260/sc-356376.pdf
>  
> As Tilman notes on the issue, the PDF actually includes http headers before 
> the PDF:
> {noformat}
> HTTP/1.1 200 OK
> Cache-Control: private
> Pragma: Public
> Content-Type: application/pdf; charset=UTF-8
> Server: Microsoft-IIS/7.5
> Set-Cookie: ASP.NET_SessionId=ibc3nfydvyfh1z55zqis2q3y; path=/; HttpOnly
> Content-Disposition: inline; filename=_MTR_AGHS_EN.pdf
> X-AspNet-Version: 2.0.50727
> X-Powered-By: ASP.NET
> Date: Fri, 18 Sep 2015 17:30:08 GMT
> Content-Length: 56779
> %PDF-1.4 
> {noformat}
> I'm not sure how or if we want to fix these.  I'm going to look at the 
> others.  Y, others also come w HTTP Headers: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3/QL/QLPQA77R36REFEF3ICLL2NPTXWJXKV54



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (TIKA-3706) Add a parser for HTTPResponse?

Reply via email to