[
https://issues.apache.org/jira/browse/TIKA-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236050#comment-16236050
]
Hudson commented on TIKA-2485:
------------------------------
SUCCESS: Integrated in Jenkins build Tika-trunk #1382 (See
[https://builds.apache.org/job/Tika-trunk/1382/])
TIKA-2485 -- Allow configuration of markLimit in EncodingDetectors via
(tallison:
[https://github.com/apache/tika/commit/c009dc71cc8428e0a752100af7e9d18d7e5e3096])
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/txt/UniversalEncodingDetector.java
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/txt/Icu4jEncodingDetector.java
* (edit)
tika-parsers/src/test/java/org/apache/tika/config/TikaEncodingDetectorTest.java
* (edit) CHANGES.txt
* (add)
tika-parsers/src/test/resources/org/apache/tika/config/TIKA-2485-encoding-detector-mark-limits.xml
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java
> EncodingDetectors markLimits to be configurable
> -----------------------------------------------
>
> Key: TIKA-2485
> URL: https://issues.apache.org/jira/browse/TIKA-2485
> Project: Tika
> Issue Type: Improvement
> Components: detector
> Affects Versions: 1.16
> Reporter: Markus Jelsma
> Assignee: Tim Allison
> Priority: Minor
> Fix For: 1.17
>
>
> Tim's response to my question:
> -----Original message-----
> > From:Allison, Timothy B. <[email protected]>
> > Sent: Friday 27th October 2017 14:53
> > To: [email protected]
> > Subject: RE: Incorrect encoding detected
> >
> > Hi Markus,
> >
> > My guess is that the ~32,000 characters of mostly ascii-ish <script/> are
> > what is actually being used for encoding detection. The
> > HTMLEncodingDetector only looks in the first 8,192 characters, and the
> > other encoding detectors have similar (but longer?) restrictions.
> >
> > At some point, I had a dev version of a stripper that removed contents of
> > <script/> and <style/> before trying to detect the encoding[0]...perhaps it
> > is time to resurrect that code and integrate it?
> >
> > Or, given that HTML has been, um, blossoming, perhaps, more simply, we
> > should expand how far we look into a stream for detection?
> >
> > Cheers,
> >
> > Tim
> >
> > [0] https://issues.apache.org/jira/browse/TIKA-2038
> >
> >
> > -----Original Message-----
> > From: Markus Jelsma [mailto:[email protected]]
> > Sent: Friday, October 27, 2017 8:39 AM
> > To: [email protected]
> > Subject: Incorrect encoding detected
> >
> > Hello,
> >
> > We have a problem with Tika, encoding and pages on this website:
> > https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
> >
> > Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that the
> > regular HTML parser does a fine job, but our TikaParser has a tough job
> > dealing with this HTML. For some reason Tika thinks
> > Content-Encoding=windows-1252 is what this webpage says it is, instead the
> > page identifies itself properly as UTF-8.
> >
> > Of all websites we index, this is so far the only one giving trouble
> > indexing accents, getting fÃ¥ instead of a regular få.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)