[ 
https://issues.apache.org/jira/browse/TIKA-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16823390#comment-16823390
 ] 

Ross Johnson commented on TIKA-2858:
------------------------------------

OK I confirmed that the issue with the Unicode password is SASLPrep.

Acrobat normalizes the Unicode vulgar fractions, so the "real" password is 
slightly different. PDFBox is successfully able to open the file with the 
following password string:

{code:java}
// Note that the vulgar fractions have been normalized to plain numbers with 
U+2044 division symbol
  ! < > " \ € œ ¤ 1⁄4 1⁄2 𠜎 𩶘 😀  
{code}

I think SASLPrep will not change the output if applied subsequent times, so I 
can do this normalization on the client side for now, and then it doesn't 
matter if or when PDFBox adds it. 

> JAXRS server: allow passwords with special chars (MIME encoded words)
> ---------------------------------------------------------------------
>
>                 Key: TIKA-2858
>                 URL: https://issues.apache.org/jira/browse/TIKA-2858
>             Project: Tika
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 1.20
>            Reporter: Ross Johnson
>            Priority: Minor
>         Attachments: protected - 4 space password.pdf, protected - Unicode 
> password.pdf
>
>
> Tika Server allows passing a document password in a special {{Password}} 
> request header; however, I don't believe this header allows for passwords 
> with non-US-ASCII characters, or for passwords with leading or trailing 
> spaces.
> One potential solution would be to allow MIME encoded-word values (RFC 2047) 
> in the password header so that one could specify any password with only 
> US-ASCII. This extra decoding could be enabled / disabled with some other 
> flag or header value, in order to avoid any breaking changes for clients that 
> are not encoding this header (e.g. if the password happens to literally be 
> "{{=?UTF-8?B??=}}").
> Attached are 2 sample PDF files that I'm unable to use with TIka Server due 
> to their passwords. These passwords are a bit contrived, but I have come 
> across this issue with real passwords. I've included the passwords in code 
> blocks to avoid the issue editor / viewer from collapsing multiple spaces 
> into one.
> The file named "{{protected - 4 space password.pdf}}" has a password of 4 
> literal spaces:
> {code:java}
> // Password is on line below (4 literal spaces)
>     
> {code}
> The file named "{{protected - Unicode password.pdf}}" has a password of 
> mostly special characters, with 2 leading spaces and 2 trailing spaces thrown 
> in for good measure:
> {code:java}
> // Password is on following line (with 2 leading spaces, 2 trailing spaces)
>   ! < > " \ € œ ¤ ¼ ½ 𠜎 𩶘 😀  
> {code}
>      



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to