LiJie20190102 commented on PR #10014:
URL: https://github.com/apache/seatunnel/pull/10014#issuecomment-3491147147
> /home/ftp_test/test_json/json[^/]*/.*.json
> Thank you for your contribution.
>
> The following tasks need to be done before merged:
>
> 1. Modify the code to maintain backward compatibility:
>
> * Support both path matching and filename matching
> * Add detailed code comments to explain the behavior
>
> 2. Add the related test cases
> 3. Please update the documentation and the configuration description of
`FILE_FILTER_PATTERN`
>
> Here is an example code you can refer to.
>
> ```
> protected boolean filterFileByPattern(FileStatus fileStatus) {
> if (Objects.nonNull(pattern)) {
> String fullPath = fileStatus.getPath().toUri().getPath();
> String fileName = fileStatus.getPath().getName();
>
> // Match against both full path and file name for maximum
compatibility
> // This allows users to use either path-based patterns (e.g.,
"/path/to/dir/.*.json")
> // or name-based patterns (e.g., "e2e_filter.*")
> boolean matches = pattern.matcher(fullPath).matches() ||
> pattern.matcher(fileName).matches();
>
> if (log.isDebugEnabled()) {
> log.debug("Filtering file: fullPath={}, fileName={},
pattern={}, matches={}",
> fullPath, fileName, pattern.pattern(), matches);
> }
>
> return matches;
> }
> return true;
> }
> ```
I think using 'pattern.matcher(fullPath).matches() ||
pattern.matcher(fileName).matches()' is unreasonable because it will result in
obtaining some files that the user does not want. For example, some files such
as :
`/home/ftp_test/test_json/reporttxt '
/home/ftp_test/json/report.txt
/home/ftp_test/json/abch202410.json
/home/ftp_test/json/abcg202410.json
/home/ftp_test/txt/old_data.csv
/home/ftp_test/txt/aa.json`
If the user wants to match third level folders starting with test_json and
files ending with .json, the Regular Expression:
/home/ftp_test/json[^/]*/.*.json, At this point, the 'aa. json' file will
also be filtered, but I don't think this is what the user wants。
In summary, my idea is that if a user starts with an absolute path address,
that is, with the 'path' parameter, what they want is to filter based on the
path and file, that is, to use 'fullPath' for filtering. If the regular
expression for obtaining files is used directly, the file name should be
filtered, which is' fileName '. I don't know if I described it clearly, what do
you think @davidzollo
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]