[ 
https://issues.apache.org/jira/browse/RAT-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17900639#comment-17900639
 ] 

ASF subversion and git services commented on RAT-81:
----------------------------------------------------

Commit 0a9559e1d8726ba16c933ec560f7e5d42400360e in creadur-rat's branch 
refs/heads/feature/RAT-397 from Claude Warren
[ https://gitbox.apache.org/repos/asf?p=creadur-rat.git;h=0a9559e1 ]

RAT-81: Fixed encoding issue causing text files to not be read properly (#395)

* Fixed encoding issue where text files not in UTF-8 encoding would not be 
properly.

Change adds charset to the metadata when it can be discovered.  If not UTF8 is 
returned.

Added integration test RAT-81 to show reading of UTF8 and IBM037 encoding works.

* Minor fixes

* RAT-81: Add changelog about encoding bugfix

* added logging and removed dead code

* fix for RAT-96

Added mediaType and encoding attributes to XML output.
Added updated DefaultAnalyserFactoryTests to account for change
Added integration tests for RAT-147 and RAT-211 based on code in 
DefaultAnalyserFactoryTests
Updated ReportTest to add dependencies and package jar to classpath for test.
Fixed testing issues in Ant unit caused by addition of mediatype and attributes.
renamed reportTest directories to use a '_' rather than a '-' to account for 
java package names.

* RAT-81: groovify the test code, minor fixes

* RAT-81: Add mediaType and encoding to RAT report, minor fixes

---------

Co-authored-by: P. Ottlinger <pottlin...@apache.org>
Co-authored-by: P. Ottlinger <ottlin...@users.noreply.github.com>

> MalformedInputException thrown when RAT tries reading file
> ----------------------------------------------------------
>
>                 Key: RAT-81
>                 URL: https://issues.apache.org/jira/browse/RAT-81
>             Project: Apache Rat
>          Issue Type: Bug
>          Components: core engine
>    Affects Versions: 0.6, 0.7, 0.11
>         Environment: Linux (Ubuntu) on x86, running with "default" file 
> encoding set to UTF-8
>            Reporter: Marshall Schor
>            Assignee: Claude Warren
>            Priority: Minor
>             Fix For: 0.17
>
>
> To reproduce, set the platform default locale to something that indicates 
> UTF-8 file encoding.
> This causes code in (for example) org.apache.rat.document.impl.FileDocument 
> which return FileReader to set up RAT to use a reader which is using the 
> platform default character encoding (in this case UTF-8).
> If the file being processed is not encoded in this , it is possible that the 
> reader will read some data which is "invalid" UTF-8 encodings, which causes 
> the reader to throw a MalformedInputException error.
> One case we found:
> The file being examined had invalid UTF-8 encodings.  First, Rat ran the 
> BinaryGuesser - but that returned false because it attempted to read the 
> first 100 or so chars, and got a "MalformedInputException" instead, so the 
> try/catch block just ended up returning "false" (not binary).  Then the 
> HeaderChecker tried to read the file to check the header, and got this same 
> exception - but this time, it made RAT fail.
> Here's the last part of the stack trace:
> Caused by: org.apache.rat.report.RatReportFailedException: Analysis failed
>     at org.apache.rat.report.xml.XmlReport.report(XmlReport.java:66)
>     at org.apache.rat.mp.FilesReportable.run(FilesReportable.java:69)
>     at org.apache.rat.Report.report(Report.java:292)
>     at org.apache.rat.Report.report(Report.java:272)
>     at 
> org.apache.rat.mp.AbstractRatMojo.createReport(AbstractRatMojo.java:341)
>     ... 23 more
> Caused by: org.apache.rat.document.RatDocumentAnalysisException: Cannot 
> analyse header
>     at 
> org.apache.rat.report.analyser.DocumentHeaderAnalyser.analyse(DocumentHeaderAnalyser.java:54)
>     at 
> org.apache.rat.document.impl.util.DocumentAnalyserMultiplexer.analyse(DocumentAnalyserMultiplexer.java:37)
>     at 
> org.apache.rat.document.impl.util.ConditionalAnalyser.matches(ConditionalAnalyser.java:44)
>     at 
> org.apache.rat.document.impl.util.ConditionalAnalyser.analyse(ConditionalAnalyser.java:50)
>     at org.apache.rat.report.xml.XmlReport.report(XmlReport.java:64)
>     ... 27 more
> Caused by: org.apache.rat.analysis.RatHeaderAnalysisException: Cannot read 
> header for 
> /home/tgoetz/tmp/uimaj-2.3.1/uimaj-core/src/test/resources/pearTests/encodingTests/UTF16_with_signature.xml
>     at 
> org.apache.rat.report.analyser.HeaderCheckWorker.read(HeaderCheckWorker.java:96)
>     at 
> org.apache.rat.report.analyser.DocumentHeaderAnalyser.analyse(DocumentHeaderAnalyser.java:50)
>     ... 31 more
> Caused by: sun.io.MalformedInputException
>     at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:294)
>     at 
> sun.nio.cs.StreamDecoder$ConverterSD.convertInto(StreamDecoder.java:316)
>     at sun.nio.cs.StreamDecoder$ConverterSD.implRead(StreamDecoder.java:366)
>     at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:252)
>     at java.io.InputStreamReader.read(InputStreamReader.java:212)
>     at java.io.BufferedReader.fill(BufferedReader.java:157)
>     at java.io.BufferedReader.readLine(BufferedReader.java:320)
>     at java.io.BufferedReader.readLine(BufferedReader.java:383)
>     at 
> org.apache.rat.report.analyser.HeaderCheckWorker.readLine(HeaderCheckWorker.java:111)
>     at 
> org.apache.rat.report.analyser.HeaderCheckWorker.read(HeaderCheckWorker.java:89)
>     ... 32 more 
> Work-around: mark these files for explicit exclusion.
> Fix: change the binaryguesser to read the files in binary (not assuming any 
> character coding) and operate with that data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to