Nick Burch commented on TIKA-2597:

Trying to fully re-implement the Windows case-insensitivity rules doesn't sound 
that much fun... Unless someone can find a small library / JRE system function 
that does it for us?

Otherwise, Microsoft have been doing some work recently to fix various Windows 
bugs and limitations around their case-sensitivity. You might find it easier to 
just turn that on for your extraction directories! Details from a few days ago 

> Attachment Extraction Case Sensitivity
> --------------------------------------
>                 Key: TIKA-2597
>                 URL: https://issues.apache.org/jira/browse/TIKA-2597
>             Project: Tika
>          Issue Type: Bug
>          Components: app
>    Affects Versions: 1.17
>         Environment: windows
>            Reporter: Todd Dixon
>            Priority: Major
> Using the --extract option on a pdf with embedded files I am seeing that not 
> all of the attachments are extracted.  There are several files embedded that 
> contain the same name.  The names that are exactly the same are accounted for 
> with a suffix of (1) etc.  However when there is a similar name that is not 
> the same case the parse does not account for changing the name with the 
> suffix and thus overwrites the file on disk.  Example
> FW Letter,.msg
> FW letter.msg
> Will result in only one attachment extracted.  Would it be possible to update 
> the filename comparison to account for windows file systems which see those 
> two files as the same name?
> Thanks!

This message was sent by Atlassian JIRA

Reply via email to