[
https://issues.apache.org/jira/browse/TIKA-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-3864.
-------------------------------
Fix Version/s: 2.5.1
Resolution: Fixed
> Non-ascii UTF-8 characters in fetchKey not working with FileSystemFetcher
> -------------------------------------------------------------------------
>
> Key: TIKA-3864
> URL: https://issues.apache.org/jira/browse/TIKA-3864
> Project: Tika
> Issue Type: Bug
> Components: tika-pipes, tika-server
> Affects Versions: 2.4.1
> Environment: debian:bullseye docker container running
> tika-server-standard-2.4.1jar
> Reporter: Tong Wang
> Priority: Major
> Fix For: 2.5.1
>
>
> When use FileSystemFetcher, if there is non-ascii characters in fetchKey,
> Tika Server throws exception because the file name is incorrect. Here is an
> example:
> {code:java}
> curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted"
> --header "fetchKey: 中文.txt" {code}
> I get java.nio.file.NoSuchFileException:
> {code:java}
> Caused by: java.nio.file.NoSuchFileException: /restricted/䏿.txt at
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
> at
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
> at
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
> at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860) at
> org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64)
> at
> org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90)
> at
> org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159)
> {code}
>
> When I try to quote the characters:
> {code:java}
> curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted"
> --header "fetchKey: %E4%B8%AD%E6%96%87.txt" {code}
> I still get a java.nio.file.NoSuchFileException:
> {code:java}
> Caused by: java.nio.file.NoSuchFileException:
> /restricted/%E4%B8%AD%E6%96%87.txt at
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
> at
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
> at
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
> at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860) at
> org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64)
> at
> org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90)
> at
> org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159){code}
> BTW, locale is set to C.UTF-8 on Tika Server:
> {code:java}
> # locale
> LANG=C.UTF-8
> LANGUAGE=
> LC_CTYPE="C.UTF-8"
> LC_NUMERIC="C.UTF-8"
> LC_TIME="C.UTF-8"
> LC_COLLATE="C.UTF-8"
> LC_MONETARY="C.UTF-8"
> LC_MESSAGES="C.UTF-8"
> LC_PAPER="C.UTF-8"
> LC_NAME="C.UTF-8"
> LC_ADDRESS="C.UTF-8"
> LC_TELEPHONE="C.UTF-8"
> LC_MEASUREMENT="C.UTF-8"
> LC_IDENTIFICATION="C.UTF-8"
> LC_ALL= {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)