[ 
https://issues.apache.org/jira/browse/TIKA-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4195:
------------------------------
    Description: 
The JSoupParser runs encoding detection on the InputStream. If the result is 
null, the parser applies the default charset -- US-ASCII. This behavior is ok. 

The problem is that there is no way to distinguish when a faulty encoding 
detector alleges 'US-ASCII' and the default behavior of the JSoupParser. I 
don't think the JSoupParser should report the fallback encoding as if it were 
detected.

I'm not sure how best to report this in the metadata, but we need to be able to 
differentiate detection and fallback encoding.

  was:
The JSoupParser is runs encoding detection on the inputstream. If the result is 
null, the parser applies the default charset -- US-ASCII. This behavior is ok. 

The problem is that there is no way to distinguish when a faulty encoding 
detector alleges 'US-ASCII' and the default behavior of the JSoupParser. I 
don't think the JSoupParser should report the fallback encoding as if it were 
detected.

I'm not sure how best to report this in the metadata, but we need to be able to 
differentiate detection and fallback encoding.


> JSoupParser conceals null from the EncodingDetector
> ---------------------------------------------------
>
>                 Key: TIKA-4195
>                 URL: https://issues.apache.org/jira/browse/TIKA-4195
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>             Fix For: 3.0.0
>
>
> The JSoupParser runs encoding detection on the InputStream. If the result is 
> null, the parser applies the default charset -- US-ASCII. This behavior is 
> ok. 
> The problem is that there is no way to distinguish when a faulty encoding 
> detector alleges 'US-ASCII' and the default behavior of the JSoupParser. I 
> don't think the JSoupParser should report the fallback encoding as if it were 
> detected.
> I'm not sure how best to report this in the metadata, but we need to be able 
> to differentiate detection and fallback encoding.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to