[ 
https://issues.apache.org/jira/browse/TIKA-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17816780#comment-17816780
 ] 

Hudson commented on TIKA-4195:
------------------------------

SUCCESS: Integrated in Jenkins build Tika ยป tika-main-jdk11 #1504 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1504/])
TIKA-4195 -- jsoup parser shouldn't conceal backoff to default encoding (#1591) 
(github: 
[https://github.com/apache/tika/commit/455409bf80801152e7c855ddc994fedc32c4cfcf])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/TXTParserTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
* (edit) tika-core/src/main/java/org/apache/tika/detect/AutoDetectReader.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java
* (edit) 
tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java
* (edit) 
tika-core/src/main/java/org/apache/tika/detect/CompositeEncodingDetector.java


> JSoupParser conceals null from the EncodingDetector
> ---------------------------------------------------
>
>                 Key: TIKA-4195
>                 URL: https://issues.apache.org/jira/browse/TIKA-4195
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>             Fix For: 3.0.0
>
>
> The JSoupParser runs encoding detection on the InputStream. If the result is 
> null, the parser applies the default charset -- US-ASCII. This behavior is 
> ok. 
> The problem is that there is no way to distinguish when a faulty encoding 
> detector alleges 'US-ASCII' and the default behavior of the JSoupParser. I 
> don't think the JSoupParser should report the fallback encoding as if it were 
> detected.
> I'm not sure how best to report this in the metadata, but we need to be able 
> to differentiate detection and fallback encoding.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to