[ 
https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17900641#comment-17900641
 ] 

ASF subversion and git services commented on RAT-147:
-----------------------------------------------------

Commit 0a9559e1d8726ba16c933ec560f7e5d42400360e in creadur-rat's branch 
refs/heads/feature/RAT-397 from Claude Warren
[ https://gitbox.apache.org/repos/asf?p=creadur-rat.git;h=0a9559e1 ]

RAT-81: Fixed encoding issue causing text files to not be read properly (#395)

* Fixed encoding issue where text files not in UTF-8 encoding would not be 
properly.

Change adds charset to the metadata when it can be discovered.  If not UTF8 is 
returned.

Added integration test RAT-81 to show reading of UTF8 and IBM037 encoding works.

* Minor fixes

* RAT-81: Add changelog about encoding bugfix

* added logging and removed dead code

* fix for RAT-96

Added mediaType and encoding attributes to XML output.
Added updated DefaultAnalyserFactoryTests to account for change
Added integration tests for RAT-147 and RAT-211 based on code in 
DefaultAnalyserFactoryTests
Updated ReportTest to add dependencies and package jar to classpath for test.
Fixed testing issues in Ant unit caused by addition of mediatype and attributes.
renamed reportTest directories to use a '_' rather than a '-' to account for 
java package names.

* RAT-81: groovify the test code, minor fixes

* RAT-81: Add mediaType and encoding to RAT report, minor fixes

---------

Co-authored-by: P. Ottlinger <pottlin...@apache.org>
Co-authored-by: P. Ottlinger <ottlin...@users.noreply.github.com>

> binary guesser design improvement
> ---------------------------------
>
>                 Key: RAT-147
>                 URL: https://issues.apache.org/jira/browse/RAT-147
>             Project: Apache Rat
>          Issue Type: Improvement
>    Affects Versions: 0.8
>            Reporter: Marshall Schor
>            Assignee: Claude Warren
>            Priority: Minor
>             Fix For: 0.17
>
>         Attachments: unix-newlines.txt.bin, windows-newlines.txt.bin
>
>
> A release manager cut a release; RAT was run, all was OK.  Another user tried 
> building from source / tag, and RAT complained of 2 files missing headers.  
> This was traced to the "binary guesser" which read the 1st 200 bytes of a 
> file and "guessed" if it was binary.  The file in question had a UTF-8 
> byte-order mark at the beginning, and was, in fact after that, plain ASCII.  
> The reason for 2 different results: the release manager's OS had a default 
> file encoding set to US-ASCII (as determined by running a small Java program 
> that prints out the value of System.property("file.encoding").  This encoding 
> is for 7-bit ASCII, so the guesser when decoding this gets a malformed 
> exception on the 3 bytes at the beginning of the file.  This causes the 
> guesser to conclude this is a "binary" file which doesn't need to be 
> RAT-checked.  The other user was on a Windows 7 machine, which has the 
> file.encoding defaulting to Cp1252 - which does have code points defined for 
> the first 3 bytes, and therefore doesn't throw any exception.  This makes the 
> guesser guess that  this isn't a binary file, and it checks the file and 
> reports a missing header (the file is test data...).
> Workaround - add the file to the explicit excludes.
> Potential problem - on a machine with default encoding US-ASCII, RAT will 
> improperly skip checking files which perhaps should have headers, if they 
> have a UTF-8 byte-order mark.
> Potential problem #2 - RAT is dependent on the default file encoding setting 
> for part of its behavior, causing differences in what it checks.
> I'm not sure what a good solution would be here.  It might range from 
> eliminating the binary "guesser" that looks at the first 200 bytes of a file, 
> to forcing UTF-8 as the charset to use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to