Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


Claudenw commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1582025702


##
apache-rat-core/src/main/java/org/apache/rat/Defaults.java:
##
@@ -57,6 +62,10 @@ public class Defaults {
 public static final String UNAPPROVED_LICENSES_STYLESHEET = 
"org/apache/rat/unapproved-licenses.xsl";
 
 private final LicenseSetFactory setFactory;
+
+private final FilenameFilter filesToIgnore = 
WildcardFileFilter.builder().setWildcards("*.json").setIoCase(IOCase.INSENSITIVE).get();
+
+private final IOFileFilter directoriesToIgnore = 
NameBasedHiddenFileFilter.HIDDEN;

Review Comment:
   Defaults is intended to be the System defaults for the ReportConfiguration.  
There are some cases where the report option has to be set by the UI before the 
Defaults can be tested.  And there is a flag for no defaults so the Defaults 
need to be specified outside of the ReportConfiguration initialization.  
   
   I changed the description of Defaults and updated the values and methods to 
be static.
   
   I also added a checklist at the top of this ticket to track the things we 
need to update as I suspect it it going to get longish.  Feel free to add to 
it.  I think items on the list can be closed if we account for them in this 
change or open a ticket to track them for a new change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


Claudenw commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1582025702


##
apache-rat-core/src/main/java/org/apache/rat/Defaults.java:
##
@@ -57,6 +62,10 @@ public class Defaults {
 public static final String UNAPPROVED_LICENSES_STYLESHEET = 
"org/apache/rat/unapproved-licenses.xsl";
 
 private final LicenseSetFactory setFactory;
+
+private final FilenameFilter filesToIgnore = 
WildcardFileFilter.builder().setWildcards("*.json").setIoCase(IOCase.INSENSITIVE).get();
+
+private final IOFileFilter directoriesToIgnore = 
NameBasedHiddenFileFilter.HIDDEN;

Review Comment:
   Defaults is intended to be the System defaults for the ReportConfiguration.  
There are some cases where the report option has to be set by the UI before the 
Defaults can be tested.  And there is a flag for no defaults so the Defaults 
need to be specified outside of the ReportConfiguration initialization.  
   
   I changed the description of Defaults and updated the values and methods to 
be static.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (WHISKER-25) Unable to update to latest ASF parent v32 due to dependency change

2024-04-27 Thread Philipp Ottlinger (Jira)
Philipp Ottlinger created WHISKER-25:


 Summary: Unable to update to latest ASF parent v32 due to 
dependency change
 Key: WHISKER-25
 URL: https://issues.apache.org/jira/browse/WHISKER-25
 Project: Apache Whisker
  Issue Type: Bug
Reporter: Philipp Ottlinger


Similar to TENTACLES-19 the new ASF parent v32 seems to brought in changes that 
broke the build:


Caused by: org.apache.maven.plugin.PluginContainerException: A required class 
was missing while executing 
org.apache.maven.plugins:maven-jar-plugin:3.4.0:jar: 
org/apache/commons/io/file/attribute/FileTimes

This is related to a mix of commons v2/v3 JARs on the classpath!




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] WHISKER-25: Bump org.apache:apache from 31 to 32 [creadur-whisker]

2024-04-27 Thread via GitHub


ottlinger commented on PR #142:
URL: https://github.com/apache/creadur-whisker/pull/142#issuecomment-2081196280

   Seems to be a dependency problem with v2/v3 of commons-lang


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] TENTACLES-19: Bump org.apache:apache from 31 to 32 [creadur-tentacles]

2024-04-27 Thread via GitHub


ottlinger commented on PR #119:
URL: 
https://github.com/apache/creadur-tentacles/pull/119#issuecomment-2081191802

   Problem seems to be related to velocity dependency that pulls in commons-v2, 
while new ASF parent sets v3 as default version.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1581906723


##
apache-rat-core/src/main/java/org/apache/rat/report/claim/ClaimStatistic.java:
##
@@ -57,45 +58,71 @@ public int getCounter(Counter counter) {
 return count == null ? 0 : count[0];
 }
 
-/**
- * @return Returns a map with the file types. The map keys
- * are file type names and the map values
- * are integers with the number of resources matching
- * the file type.
- */
-public Map getCounterMap() {
-return counterMap;
+public void incCounter(Counter key, int value) {
+final int[] num = counterMap.get(key);
+
+if (num == null) {
+counterMap.put(key, new int[] { value });
+} else {
+num[0] += value;
+}
 }
 
-
 /**
- * @return Returns a map with the file types. The map keys
- * are file type names and the map values
- * are integers with the number of resources matching
- * the file type.
+ * Returns the counts for the counter.
+ * @param documentType the document type to get the counter for.
+ * @return Returns the number of files with approved licenses.
  */
-public Map getDocumentCategoryMap() {
-return documentCategoryMap;
+public int getCounter(Document.Type documentType) {
+int[] count = documentCategoryMap.get(documentType);
+return count == null ? 0 : count[0];
 }
 
-/**
- * @return Returns a map with the license family codes. The map
- * keys are license family category names,
- * the map values are integers with the number of resources
- * matching the license family code.
- */
-public Map getLicenseFamilyCodeMap() {
-return licenseFamilyCodeMap;
+public void incCounter(Document.Type documentType, int value) {
+final int[] num = documentCategoryMap.get(documentType);
+
+if (num == null) {
+documentCategoryMap.put(documentType, new int[] { value });
+} else {
+num[0] += value;
+}
 }
 
-/**
- * @return Returns a map with the license family codes. The map
- * keys are the names of the license families and
- * the map values are integers with the number of resources
- * matching the license family name.
- */
-public Map getLicenseFileNameMap() {
-return licenseFamilyNameMap;
+public int getLicenseFamilyCount(String licenseFamilyName) {
+int[] count = licenseFamilyCodeMap.get(licenseFamilyName);
+return count == null ? 0 : count[0];
+}
+
+public void incLicenseFamilyCount(String licenseFamilyName, int value) {
+final int[] num = licenseFamilyCodeMap.get(licenseFamilyName);
+
+if (num == null) {
+licenseFamilyCodeMap.put(licenseFamilyName, new int[] { value });
+} else {
+num[0] += value;
+}
 }
 
+public Set getLicenseFamilyNames() {
+return Collections.unmodifiableSet(licenseFamilyCodeMap.keySet());
+}
+
+public Set getLicenseFileNames() {
+return Collections.unmodifiableSet(licenseFamilyNameMap.keySet());
+}
+
+public int getLicenseFileNameCount(String licenseFilename) {
+int[] count = licenseFamilyNameMap.get(licenseFilename);
+return count == null ? 0 : count[0];
+}
+
+public void incLicenseFileNameCount(String licenseFileNameName, int value) 
{
+final int[] num = licenseFamilyNameMap.get(licenseFileNameName);
+
+if (num == null) {
+licenseFamilyNameMap.put(licenseFileNameName, new int[] { value });
+} else {
+num[0] += value;

Review Comment:
   AtomicInteger?



##
apache-rat-core/src/main/java/org/apache/rat/walker/DirectoryWalker.java:
##
@@ -38,56 +38,42 @@ public class DirectoryWalker extends Walker implements 
IReportable {
 
 private static final FileNameComparator COMPARATOR = new 
FileNameComparator();
 
-private final IOFileFilter directoryFilter;
-
-/**
- * Constructs a walker.
- *
- * @param file the directory to walk.
- * @param directoryFilter directory filter to eventually exclude some 
directories/files from the scan.
- */
-public DirectoryWalker(File file, IOFileFilter directoryFilter) {
-this(file, (FilenameFilter) null, directoryFilter);
-}
+private final IOFileFilter directoriesToIgnore;
 
 /**
  * Constructs a walker.
  *
  * @param file the directory to walk (not null).
- * @param filter filters input files (optional),
+ * @param filesToIgnore filters input files (optional),
  *   or null when no filtering should be performed
- * @param directoryFilter filters directories (optional), or null when no 
filtering should be performed.
+ * @param directoriesToIgnore filters directories (optional), or null when 
no 

[jira] [Updated] (RAT-150) RAT should use Apache Tika to simply guess ignored [application/X] file types and focus on the [text/Y] family as a sensible default

2024-04-27 Thread Philipp Ottlinger (Jira)


 [ 
https://issues.apache.org/jira/browse/RAT-150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philipp Ottlinger updated RAT-150:
--
Fix Version/s: 0.17

> RAT should use Apache Tika to simply guess ignored [application/X] file types 
> and focus on the [text/Y] family as a sensible default
> 
>
> Key: RAT-150
> URL: https://issues.apache.org/jira/browse/RAT-150
> Project: Apache Rat
>  Issue Type: New Feature
>  Components: mime-meta-data, scan
>Affects Versions: 0.8
>Reporter: Chris A. Mattmann
>Assignee: Claude Warren
>Priority: Major
> Fix For: 0.17
>
>
> RAT could use Apache Tika to automatically guess file types, obviating the 
> need to specify an explicit white list or black list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (RAT-211) Generated rat-output.xml must be well-formed, even if BinaryGuesser fails

2024-04-27 Thread Philipp Ottlinger (Jira)


 [ 
https://issues.apache.org/jira/browse/RAT-211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philipp Ottlinger updated RAT-211:
--
Fix Version/s: 0.17

> Generated rat-output.xml must be well-formed, even if BinaryGuesser fails
> -
>
> Key: RAT-211
> URL: https://issues.apache.org/jira/browse/RAT-211
> Project: Apache Rat
>  Issue Type: Bug
>Reporter: Konstantin Kolinko
>Assignee: Claude Warren
>Priority: Major
> Fix For: 0.17
>
> Attachments: rat-output.xml
>
>
> This issue was originally reported by Infrastructure team while running RAT 
> over Apache Tomcat source code, see thread
> "Files to exclude from buildbot rat tests" (started 2016-02-15) at dev "at" 
> tomcat.apache.org mailing list. (1)
> The issue:
> ===
> 1. Buildbot at ASF is configured to run RAT tool over tomcat-trunk, tomcat-8, 
> tomcat-7 source code.
> 2. Tomcat has \*.bmp, \*.dia files in its source code (images used by Windows 
> installer, diagrams in documentation) that RAT failed to recognize as binary.
> 3. RAT generated rat-output.xml file that included header-sample fragments of 
> those *.bmp and *.dia files. Those fragments are actually binary garbage.  
> The result is that a broken XML file was generated.
> 4. XSLT transformation from rat-output.xml into rat-output.html failed.
> I have not seen the actual error printed by XSLT processor, but I confirmed 
> that the file is broken by downloading rat-output.xml and opening it in 
> Firefox. Firefox reported a syntax error.
> Workaround:
> ===
> rat-excludes.txt file in Tomcat source code was updated to exclude
> \*\*/\*.bmp
> \*\*/\*.dia
> References:
> ===
> 1. "Files to exclude from buildbot rat tests" (started 2016-02-15) at dev 
> "at" tomcat.apache.org mailing list.
> http://markmail.org/message/rhrm54ch5omjalt4
> 2. Apache Tomcat links to Buildbot resuls:
> http://tomcat.apache.org/ci.html#Buildbot
> 3. Apache Tomcat source code
> http://tomcat.apache.org/svn.html
> Notes:
> - RAT excludes files in Tomcat source code are at
> res/rat/rat-excludes.txt
> - I know that Buildbot uses Ant to run RAT. The Ant project file for that is 
> not in Tomcat sources, but in Infrastructure configuration (I do not have a 
> link). It can be seen in "shell_5 RAT Report Complete" step during build run. 
> E.g. here:
> https://ci.apache.org/builders/tomcat-trunk/builds/1061
> - I do not know what version of RAT is used by that build slave on Buildbot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (RAT-147) binary guesser design improvement

2024-04-27 Thread Philipp Ottlinger (Jira)


 [ 
https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philipp Ottlinger updated RAT-147:
--
Fix Version/s: 0.17

> binary guesser design improvement
> -
>
> Key: RAT-147
> URL: https://issues.apache.org/jira/browse/RAT-147
> Project: Apache Rat
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Marshall Schor
>Assignee: Claude Warren
>Priority: Minor
> Fix For: 0.17
>
> Attachments: unix-newlines.txt.bin, windows-newlines.txt.bin
>
>
> A release manager cut a release; RAT was run, all was OK.  Another user tried 
> building from source / tag, and RAT complained of 2 files missing headers.  
> This was traced to the "binary guesser" which read the 1st 200 bytes of a 
> file and "guessed" if it was binary.  The file in question had a UTF-8 
> byte-order mark at the beginning, and was, in fact after that, plain ASCII.  
> The reason for 2 different results: the release manager's OS had a default 
> file encoding set to US-ASCII (as determined by running a small Java program 
> that prints out the value of System.property("file.encoding").  This encoding 
> is for 7-bit ASCII, so the guesser when decoding this gets a malformed 
> exception on the 3 bytes at the beginning of the file.  This causes the 
> guesser to conclude this is a "binary" file which doesn't need to be 
> RAT-checked.  The other user was on a Windows 7 machine, which has the 
> file.encoding defaulting to Cp1252 - which does have code points defined for 
> the first 3 bytes, and therefore doesn't throw any exception.  This makes the 
> guesser guess that  this isn't a binary file, and it checks the file and 
> reports a missing header (the file is test data...).
> Workaround - add the file to the explicit excludes.
> Potential problem - on a machine with default encoding US-ASCII, RAT will 
> improperly skip checking files which perhaps should have headers, if they 
> have a UTF-8 byte-order mark.
> Potential problem #2 - RAT is dependent on the default file encoding setting 
> for part of its behavior, causing differences in what it checks.
> I'm not sure what a good solution would be here.  It might range from 
> eliminating the binary "guesser" that looks at the first 200 bytes of a file, 
> to forcing UTF-8 as the charset to use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (RAT-301) Rat check file identification error,java files with Chinese characters are recognized as binary files

2024-04-27 Thread Philipp Ottlinger (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841539#comment-17841539
 ] 

Philipp Ottlinger commented on RAT-301:
---

[~claude] would you mind adding 
https://github.com/apache/linkis/blob/master/linkis-public-enhancements/linkis-pes-common/src/main/java/org/apache/linkis/udf/entity/UDFVersion.java
as a test.

I'm not sure if the file looked like it when the bug was filed, but at least it 
contains some Chinese characters.

> Rat check file identification error,java files with Chinese characters are 
> recognized as binary files
> -
>
> Key: RAT-301
> URL: https://issues.apache.org/jira/browse/RAT-301
> Project: Apache Rat
>  Issue Type: Bug
>Affects Versions: 0.13
> Environment: Window  
> Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 
> 2015-11-11T00:41:47+08:00)
>Reporter: Chen Xia
>Assignee: Claude Warren
>Priority: Major
>
> {code:java}
> // code placeholder
> 
> 
> org.apache.rat
> apache-rat-plugin
> 0.13
> 
> 
> rat-validate
> validate
> 
> check
> 
> 
> 
> 
> 
> **/*.versionsBackup
> **/.idea/
> **/*.iml
> **/*.txt
> **/*.json
> web/.editorconfig
> web/.env
> web/.eslintignore
> web/.jshintrc
> web/public/favicon.ico
> web/dist/**
> web/node_modules/**
> web/apache-linkis-*-web-bin.tar.gz
> **/*.md
> .git/
> .gitignore
> **/.settings/*
> **/.classpath
> **/.project
> **/target/**
> **/out/**
> **/*.log
> CONTRIBUTING.md
> CONTRIBUTING_CN.md
> DISCLAIMER
> DISCLAIMER
> README.md
> **/META-INF/**
> .github/**
> compiler/**
> **/generated/**
> 
> 
>  {code}
> This is the result of {{mvn apache-rat:check}}
> {code:java}
> Summary
> ---
> Generated at: 2022-05-06T09:56:39+08:00
> Notes: 0
> Binaries: 1
> Archives: 0
> Standards: 13
> Apache Licensed: 13
> Generated Documents: 0
> JavaDocs are generated, thus a license header is optional.
> Generated files do not require license headers.
> 0 Unknown Licenses
> *
>   Files with Apache License headers will be marked AL
>   Binary files (which do not require any license headers) will be marked B
>   B 
> D:/DataSphere/linkis_svn/1.1.1-RC1/apache-linkis-1.1.1-incubating-src/apache-linkis-1.1.1-incubating-src/linkis-public-enhancements/linkis-publicservice/linkis-udf/linkis-udf-common/src/main/java/org/apache/linkis/udf/entity/UDFVersion.java
>   AL
> D:/DataSphere/linkis_svn/1.1.1-RC1/apache-linkis-1.1.1-incubating-src/apache-linkis-1.1.1-incubating-src/linkis-public-enhancements/linkis-publicservice/linkis-udf/linkis-udf-common/src/main/java/org/apache/linkis/udf/excepiton/UDFException.java
>   
> * {code}
> UDFVersion.java is recognized as a binary file
> source code: https://github.com/casionone/incubator-linkis/tree/dev-1.1.1-rat



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (RAT-246) .gitignore in parent dir not honored

2024-04-27 Thread Philipp Ottlinger (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841538#comment-17841538
 ] 

Philipp Ottlinger commented on RAT-246:
---

[~basinilya] did you find the opportunity to verify if the problem remains with 
the most current 0.16.1 version as it introduced a new parser of .gitignore 
files? Thanks.

> .gitignore in parent dir not honored
> 
>
> Key: RAT-246
> URL: https://issues.apache.org/jira/browse/RAT-246
> Project: Apache Rat
>  Issue Type: Bug
>Affects Versions: 0.12, 0.13
>Reporter: Ilya Basin
>Priority: Minor
>
> Due to my Eclipse plugins set, when I import a maven project, a .checkstyle 
> file is generated there. As I learned later, RAT 0.13-SNAPSHOT ignores 
> .checkstyle files, so I repeated my tests with a different filename.
> If a pattern is explicitly mentioned in the .gitignore in the project folder, 
> RAT does not complain. However, if the pattern is only mentioned in a parent 
> .gitignore, the RAT check fails.
> {code:java}
> [il@reallin wagon-scm]$ touch .someignoredfile
> [il@reallin wagon-scm]$ echo .someignoredfile >>../../.gitignore
> [il@reallin wagon-scm]$ git add .someignoredfile
> The following paths are ignored by one of your .gitignore files:
> wagon-providers/wagon-scm/.someignoredfile
> Use -f if you really want to add them.
> fatal: no files added
> [il@reallin wagon-scm]$ mvn apache-rat:check
> [ERROR] Failed to execute goal org.apache.rat:apache-rat-plugin:0.12:check 
> (default-cli) on project wagon-scm: Too many files with unapproved license: 1 
> See RAT report in: target/rat.txt -> [Help 1]
> [il@reallin wagon-scm]$ cat target/rat.txt
> Files with unapproved licenses:
>   .someignoredfile
> [il@reallin wagon-scm]$ echo .someignoredfile >>.gitignore
> [il@reallin wagon-scm]$ mvn apache-rat:check
> [INFO] BUILD SUCCESS
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1581906646


##
apache-rat-core/src/main/java/org/apache/rat/report/claim/ClaimStatistic.java:
##
@@ -57,45 +58,71 @@ public int getCounter(Counter counter) {
 return count == null ? 0 : count[0];
 }
 
-/**
- * @return Returns a map with the file types. The map keys
- * are file type names and the map values
- * are integers with the number of resources matching
- * the file type.
- */
-public Map getCounterMap() {
-return counterMap;
+public void incCounter(Counter key, int value) {
+final int[] num = counterMap.get(key);
+
+if (num == null) {
+counterMap.put(key, new int[] { value });
+} else {
+num[0] += value;
+}
 }
 
-
 /**
- * @return Returns a map with the file types. The map keys
- * are file type names and the map values
- * are integers with the number of resources matching
- * the file type.
+ * Returns the counts for the counter.
+ * @param documentType the document type to get the counter for.
+ * @return Returns the number of files with approved licenses.
  */
-public Map getDocumentCategoryMap() {
-return documentCategoryMap;
+public int getCounter(Document.Type documentType) {
+int[] count = documentCategoryMap.get(documentType);
+return count == null ? 0 : count[0];
 }
 
-/**
- * @return Returns a map with the license family codes. The map
- * keys are license family category names,
- * the map values are integers with the number of resources
- * matching the license family code.
- */
-public Map getLicenseFamilyCodeMap() {
-return licenseFamilyCodeMap;
+public void incCounter(Document.Type documentType, int value) {
+final int[] num = documentCategoryMap.get(documentType);
+
+if (num == null) {
+documentCategoryMap.put(documentType, new int[] { value });
+} else {
+num[0] += value;

Review Comment:
   AtomicInteger to be threadsafe?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1581906606


##
apache-rat-core/src/main/java/org/apache/rat/report/claim/ClaimStatistic.java:
##
@@ -57,45 +58,71 @@ public int getCounter(Counter counter) {
 return count == null ? 0 : count[0];
 }
 
-/**
- * @return Returns a map with the file types. The map keys
- * are file type names and the map values
- * are integers with the number of resources matching
- * the file type.
- */
-public Map getCounterMap() {
-return counterMap;
+public void incCounter(Counter key, int value) {
+final int[] num = counterMap.get(key);
+
+if (num == null) {
+counterMap.put(key, new int[] { value });
+} else {
+num[0] += value;

Review Comment:
   are we running into problems here, when we do multithreaded analysis? Do we 
need AtomicInteger here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1581906451


##
apache-rat-core/src/main/java/org/apache/rat/ReportConfiguration.java:
##
@@ -179,31 +177,31 @@ public boolean isDryRun() {
 /**
  * @return The filename filter for the potential input files.
  */
-public FilenameFilter getInputFileFilter() {
-return inputFileFilter;
+public FilenameFilter getFilesToIgnore() {
+return filesToIgnore;
 }
 
 /**
- * @param inputFileFilter the filename filter to filter the input files.
+ * @param filesToIgnore the filename filter to filter the input files.
  */
-public void setInputFileFilter(FilenameFilter inputFileFilter) {
-this.inputFileFilter = inputFileFilter;
+public void setFilesToIgnore(FilenameFilter filesToIgnore) {
+this.filesToIgnore = filesToIgnore;
 }
 
-public IOFileFilter getDirectoryFilter() {
-return directoryFilter;
+public IOFileFilter getDirectoriesToIgnore() {
+return directoriesToIgnore;
 }
 
-public void setDirectoryFilter(IOFileFilter directoryFilter) {
-if (directoryFilter == null) {
-this.directoryFilter = FalseFileFilter.FALSE;
+public void setDirectoriesToIgnore(IOFileFilter directoriesToIgnore) {

Review Comment:
   ScanDefault could have boolean hasFiles()/hasDirectories() methods to not 
have to handle null values while interacting with the scan configuration



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1581906153


##
apache-rat-core/src/main/java/org/apache/rat/Report.java:
##
@@ -452,11 +452,11 @@ private static IReportable getDirectory(String 
baseDirectory, ReportConfiguratio
 }
 
 if (base.isDirectory()) {
-return new DirectoryWalker(base, config.getInputFileFilter(), 
config.getDirectoryFilter());
+return new DirectoryWalker(base, config.getFilesToIgnore(), 
config.getDirectoriesToIgnore());

Review Comment:
   DirectoryWalker would also benefit from a separate class ScanDefault  
WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1581906050


##
apache-rat-core/src/main/java/org/apache/rat/Defaults.java:
##
@@ -57,6 +62,10 @@ public class Defaults {
 public static final String UNAPPROVED_LICENSES_STYLESHEET = 
"org/apache/rat/unapproved-licenses.xsl";
 
 private final LicenseSetFactory setFactory;
+
+private final FilenameFilter filesToIgnore = 
WildcardFileFilter.builder().setWildcards("*.json").setIoCase(IOCase.INSENSITIVE).get();
+
+private final IOFileFilter directoriesToIgnore = 
NameBasedHiddenFileFilter.HIDDEN;

Review Comment:
   Should we create a new class for these 2 members? ScanDefaults that contains 
files and directories to ignore?
   
   ScanDefault {
   List filesToIgnore; // wildcards
   List directoriesToIgnore; // not sure if IOFileFilter is the 
correct superclass
   }
   
   WDYT?
   
   This would allow to have a static version of this configuration set in 
Defaults.java such as
   
   public static ScanDefault RAT_DEFAULT_SCAN = new 
ScanDefault(List.of(*.json),List.of(NamebasedHiffenFilterFilter.HIDDEN); 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1581905166


##
apache-rat-core/src/main/java/org/apache/rat/Defaults.java:
##
@@ -57,6 +62,10 @@ public class Defaults {
 public static final String UNAPPROVED_LICENSES_STYLESHEET = 
"org/apache/rat/unapproved-licenses.xsl";
 
 private final LicenseSetFactory setFactory;
+
+private final FilenameFilter filesToIgnore = 
WildcardFileFilter.builder().setWildcards("*.json").setIoCase(IOCase.INSENSITIVE).get();
+
+private final IOFileFilter directoriesToIgnore = 
NameBasedHiddenFileFilter.HIDDEN;

Review Comment:
   Would it make sense to add these 2 special cases to the documentation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1581905166


##
apache-rat-core/src/main/java/org/apache/rat/Defaults.java:
##
@@ -57,6 +62,10 @@ public class Defaults {
 public static final String UNAPPROVED_LICENSES_STYLESHEET = 
"org/apache/rat/unapproved-licenses.xsl";
 
 private final LicenseSetFactory setFactory;
+
+private final FilenameFilter filesToIgnore = 
WildcardFileFilter.builder().setWildcards("*.json").setIoCase(IOCase.INSENSITIVE).get();
+
+private final IOFileFilter directoriesToIgnore = 
NameBasedHiddenFileFilter.HIDDEN;

Review Comment:
   Would it make sense to add these 2 special cases to the documentation?
   Didn't you (or JB) add a configuration option to scan for hidden files?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org