PeterAlfredLee commented on pull request #338:
URL: https://github.com/apache/tika/pull/338#issuecomment-673823280
Like [TIKA-2421](https://issues.apache.org/jira/browse/TIKA-2421) says ,
according to [w3
description](https://www.w3.org/International/questions/qa-html-encoding-declarations#utf16)
, we should read html byte mark order first.
If there is no BOM , that means it is ASCII-compatible , then we can read
this html's meta tag with ACSII and get charset.
HtmlEncodingDetector will not read html's BOM first , it assume html's meta
tag is ASCII-compatible.
StandardHtmlEncodingDetector will read BOM first , then read metadata if
there is no BOM , then read meta tag if no charset in metadata.
So I think use StandardHtmlEncodingDetector is more compliant to the w3
standard.
Only problem I can see is StandardHtmlEncodingDetector treating ISO-8859-1
as Windows-1252 , I have modify that in this PR.
So I think we can change StandardHtmlEncodingDetector as default detector.
Or we can modify HtmlEncodingDetector to compliant to w3 standard. WDYT
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]