LiJie20190102 commented on PR #10014:
URL: https://github.com/apache/seatunnel/pull/10014#issuecomment-3491147147

   > /home/ftp_test/test_json/json[^/]*/.*.json
   
   
   
   > Thank you for your contribution.
   > 
   > The following tasks need to be done before merged:
   > 
   > 1. Modify the code to maintain backward compatibility:
   > 
   > * Support both path matching and filename matching
   > * Add detailed code comments to explain the behavior
   > 
   > 2. Add the related test cases
   > 3. Please update the documentation and the configuration description of 
`FILE_FILTER_PATTERN`
   > 
   > Here is an example code you can refer to.
   > 
   > ```
   > protected boolean filterFileByPattern(FileStatus fileStatus) {
   >     if (Objects.nonNull(pattern)) {
   >         String fullPath = fileStatus.getPath().toUri().getPath();
   >         String fileName = fileStatus.getPath().getName();
   >         
   >         // Match against both full path and file name for maximum 
compatibility
   >         // This allows users to use either path-based patterns (e.g., 
"/path/to/dir/.*.json")
   >         // or name-based patterns (e.g., "e2e_filter.*")
   >         boolean matches = pattern.matcher(fullPath).matches() || 
   >                          pattern.matcher(fileName).matches();
   >         
   >         if (log.isDebugEnabled()) {
   >             log.debug("Filtering file: fullPath={}, fileName={}, 
pattern={}, matches={}", 
   >                      fullPath, fileName, pattern.pattern(), matches);
   >         }
   >         
   >         return matches;
   >     }
   >     return true;
   > }
   > ```
   
   I think using 'pattern.matcher(fullPath).matches() || 
pattern.matcher(fileName).matches()' is unreasonable because it will result in 
obtaining some files that the user does not want. For example, some files such 
as :
    `/home/ftp_test/test_json/reporttxt '
   /home/ftp_test/json/report.txt
   /home/ftp_test/json/abch202410.json
   /home/ftp_test/json/abcg202410.json
   /home/ftp_test/txt/old_data.csv
   /home/ftp_test/txt/aa.json`
   If the user wants to match third level folders starting with test_json and 
files ending with .json, the Regular Expression:
   /home/ftp_test/json[^/]*/.*.json, At this point, the 'aa. json' file will 
also be filtered, but I don't think this is what the user wants。
   
   In summary, my idea is that if a user starts with an absolute path address, 
that is, with the 'path' parameter, what they want is to filter based on the 
path and file, that is, to use 'fullPath' for filtering. If the regular 
expression for obtaining files is used directly, the file name should be 
filtered, which is' fileName '. I don't know if I described it clearly, what do 
you think @davidzollo 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to