[ 
https://issues.apache.org/jira/browse/TIKA-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15393909#comment-15393909
 ] 

Florian Leitner commented on TIKA-2041:
---------------------------------------

Nope, that is exactly the same problem I was seeing, and to my best knowledge 
there is no other case.

In single-threaded mode haveC1bytes is always false; Therefore, my suspicion is 
 that something (unshape?) triggers an update of the various mutable state 
values in CharsetDetector, and by the time the CharsetDetector instance checks 
the result of the CharsetRecognizer, instead of returning the correct Latin 
encoding name, instead it returns the Windows variant, because haveC1Bytes has 
(wrongly) been changed and set to true. (BTW, this is doubly annoying because 
even for ASCII-only documents, Latin-1 is the default result, but instead 
leading to the "windows-1252" variant being detected.)

> Charset detection doesn't appear to be thread-safe
> --------------------------------------------------
>
>                 Key: TIKA-2041
>                 URL: https://issues.apache.org/jira/browse/TIKA-2041
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>
> On the user list, Christian Leitinger noted that his team found a potential 
> issue with the thread safety of the encoding detector.  I was able to 
> reproduce this with on the corpus of html files in [~faghani]'s encoding 
> detector.
> {noformat}
>     @Test
>     public void testMultiThreadingEncodingDetection() throws Exception {
>         Path testDocs = Paths.get("C:/data/encodings/corpus");
>         List<Path> paths = new ArrayList<>();
>         Map<Path, String> encodings = new ConcurrentHashMap<>();
>         for (File encodingDirs : testDocs.toFile().listFiles()) {
>             for (File file : encodingDirs.listFiles()) {
>                     String encoding = getEncoding(file.toPath());
>                     paths.add(file.toPath());
>                     encodings.put(file.toPath(), encoding);
>             }
>         }
>         int numThreads = 1000;
>         ExecutorService ex = Executors.newFixedThreadPool(numThreads);
>         CompletionService<String> completionService =
>                 new ExecutorCompletionService<>(ex);
>         for (int i = 0; i < numThreads; i++) {
>             completionService.submit(new EncodingDetectorRunner(paths, 
> encodings), "done");
>         }
>         int completed = 0;
>         while (completed < numThreads) {
>             Future<String> future = completionService.take();
>             if (future.isDone() && "done".equals(future.get())) {
>                 completed++;
>             }
>         }
>         assertTrue("success!", true);
>     }
>     private class EncodingDetectorRunner implements Runnable {
>         private final List<Path> paths;
>         private final Map<Path, String> encodings;
>         private final Random r = new Random();
>         private EncodingDetectorRunner(List<Path> paths, Map<Path, String> 
> encodings) {
>             this.paths = paths;
>             this.encodings = encodings;
>         }
>         @Override
>         public void run() {
>             for (int i = 0; i < 100; i++) {
>                 int pInd = r.nextInt(paths.size());
>                 String detectedEncoding = null;
>                 try {
>                     detectedEncoding = getEncoding(paths.get(pInd));
>                 } catch (Exception e) {
>                     throw new RuntimeException(e);
>                 }
>                 String trueEncoding = encodings.get(paths.get(pInd));
>                 if (! detectedEncoding.equals(trueEncoding)) {
>                     throw new RuntimeException("detected: " + 
> detectedEncoding +
>                             " but should have been: "+trueEncoding + " for " 
> + paths.get(pInd));
>                 }
>             }
>         }
>     }
>     public String getEncoding(Path p) throws Exception {
>         try (InputStream is = TikaInputStream.get(p)) {
>             AutoDetectReader reader = new AutoDetectReader(is);
>             String val = reader.getCharset().toString();
>             if (val == null) {
>                 return "NULL";
>             } else {
>                 return val;
>             }
>         }
>     }
> {noformat}
> yields:
> {noformat}
> ava.util.concurrent.ExecutionException: java.lang.RuntimeException: detected: 
> ISO-8859-1 but should have been: windows-1252 for 
> C:\data\encodings\corpus\Shift_JIS\1
>       at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>       at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>       at 
> org.apache.tika.parser.html.HtmlParserTest.testMultiThreadingEncodingDetection(HtmlParserTest.java:1213)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to