[ https://issues.apache.org/jira/browse/VALIDATOR-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18018592#comment-18018592 ]
Kushal Dixit edited comment on VALIDATOR-501 at 9/7/25 5:29 AM: ---------------------------------------------------------------- Hi [~ggregory] , I see you have added test cases in revision [319400a .|https://github.com/apache/commons-validator/commit/319400a5ab8a694dd8ebc6131ae3fd20c9db0edd]. I'm not sure if a fix is already available, but I'd like to share a potential solution I've been working on. *RCA:* Validation for leading and trailing hyphens (and related patterns) is already implemented. The issue arises with domains containing special characters, such as {*}-tést.fr{*}. These get converted to punycode (e.g., {*}{{xn--tst-cpa.fr}}{*}). In the punycode form, there are no visible leading or trailing hyphens, so the domain appears valid even though the original form had an invalid leading hyphen. While the issue could potentially be solved by restructuring the existing regex validation and {{unicodeToASCII}} conversion, I believe adding an efficient pre-validation utility method would be a cleaner approach. This method would check for RFC-violating hyphens *before* the IDN conversion takes place, preventing the conversion from masking the original violations. {code:java} public boolean isValid(final String domain) { if (domain == null) { return false; } // Early checking for hyphens (-) if (hasLeadingOrTrailingHyphensInLabel(domain)) { return false; } // ... existing code } private boolean hasLeadingOrTrailingHyphensInLabel(final String domain) { // early exit final int n = domain.length(); if (n == 0) { return false; } // remove leading white spaces. int start = 0; while (start < n && Character.isWhitespace(domain.charAt(start))) { start++; } if (start == n) { return false; } // remove trailing white spaces. int end = n - 1; while (end >= start && Character.isWhitespace(domain.charAt(end))) { end--; } // check first and last char is (-) just for early exit if (domain.charAt(start) == '-' || domain.charAt(end) == '-') { return true; } for (int i = start + 1; i <= end; i++) { if (domain.charAt(i) == '.') { // Find last non-whitespace before (.) int prevPos = i - 1; while (prevPos >= start && Character.isWhitespace(domain.charAt(prevPos))) { prevPos--; } if (prevPos >= start && domain.charAt(prevPos) == '-') { return true; } // Find first non-whitespace after (.) int nextPos = i + 1; while (nextPos < end && Character.isWhitespace(domain.charAt(nextPos))) { nextPos++; } if (nextPos < end && domain.charAt(nextPos) == '-') { return true; } } } // finally return false if (-) is found no-where. return false; }{code} This is most efficient version of {{hasLeadingOrTrailingHyphensInLabel.}} If code readability is prioritised over performance, simpler implementations using {{{}regex{}}}, {{trim()}} and {{split()}} would also work, but the core principle remains the same: {*}validate hyphen placement before IDN conversion{*}. We can add an early length check in the utility method to quickly reject obviously invalid domains. This way, the complexity stays bounded by *O(MAX_DOMAIN_LENGTH)* rather than scaling with input size. {code:java} if (domain.length() > MAX_DOMAIN_LENGTH) { return true; // invalid domain } {code} Let me know your thoughts on this approach! {*}Note{*}: I have tested with different scenarios and it seems to be working for most of the cases, I found white spaces were creating some problem for ex. {{www. -google.com}} hence I handled it through different while loops. was (Author: JIRAUSER310903): Hi [~ggregory] , I see you have added test cases in revision [319400a .|https://github.com/apache/commons-validator/commit/319400a5ab8a694dd8ebc6131ae3fd20c9db0edd]. I'm not sure if a fix is already available, but I'd like to share a potential solution I've been working on. *RCA:* Validation for leading and trailing hyphens (and related patterns) is already implemented. The issue arises with domains containing special characters, such as {{{}-tést.fr{}}}. These get converted to punycode (e.g., {{{}xn---tst-cpa.fr{}}}). In the punycode form, there are no visible leading or trailing hyphens, so the domain appears valid even though the original form had an invalid leading hyphen. While the issue could potentially be solved by restructuring the existing regex validation and {{unicodeToASCII}} conversion, I believe adding an efficient pre-validation utility method would be a cleaner approach. This method would check for RFC-violating hyphens *before* the IDN conversion takes place, preventing the conversion from masking the original violations. {code:java} public boolean isValid(final String domain) { if (domain == null) { return false; } // Early checking for hyphens (-) if (hasLeadingOrTrailingHyphensInLabel(domain)) { return false; } // ... existing code } private boolean hasLeadingOrTrailingHyphensInLabel(final String domain) { // early exit final int n = domain.length(); if (n == 0) { return false; } // remove leading white spaces. int start = 0; while (start < n && Character.isWhitespace(domain.charAt(start))) { start++; } if (start == n) { return false; } // remove trailing white spaces. int end = n - 1; while (end >= start && Character.isWhitespace(domain.charAt(end))) { end--; } // check first and last char is (-) just for early exit if (domain.charAt(start) == '-' || domain.charAt(end) == '-') { return true; } for (int i = start + 1; i <= end; i++) { if (domain.charAt(i) == '.') { // Find last non-whitespace before (.) int prevPos = i - 1; while (prevPos >= start && Character.isWhitespace(domain.charAt(prevPos))) { prevPos--; } if (prevPos >= start && domain.charAt(prevPos) == '-') { return true; } // Find first non-whitespace after (.) int nextPos = i + 1; while (nextPos < end && Character.isWhitespace(domain.charAt(nextPos))) { nextPos++; } if (nextPos < end && domain.charAt(nextPos) == '-') { return true; } } } // finally return false if (-) is found no-where. return false; }{code} This is most efficient version of {{hasLeadingOrTrailingHyphensInLabel.}} If code readability is prioritised over performance, simpler implementations using {{{}regex{}}}, {{trim()}} and {{split()}} would also work, but the core principle remains the same: {*}validate hyphen placement before IDN conversion{*}. We can add an early length check in the utility method to quickly reject obviously invalid domains. This way, the complexity stays bounded by *O(MAX_DOMAIN_LENGTH)* rather than scaling with input size. {code:java} if (domain.length() > MAX_DOMAIN_LENGTH) { return true; // invalid domain } {code} Let me know your thoughts on this approach! {*}Note{*}: I have tested with different scenarios and it seems to be working for most of the cases, I found white spaces were creating some problem for ex. {{www. -google.com}} hence I handled it through different while loops. > DomainValidator accepts hyphens at start/end of domain name with Unicode > characters > ----------------------------------------------------------------------------------- > > Key: VALIDATOR-501 > URL: https://issues.apache.org/jira/browse/VALIDATOR-501 > Project: Commons Validator > Issue Type: Bug > Reporter: Victoria Dimitrova > Priority: Major > Labels: BUG, validator > > The method `DomainValidator.getInstance().isValid()` returns `true` for > invalid domain names that start or end with a hyphen when the domain name > contains Unicode characters. > This behavior is incorrect according to the domain name specifications (RFC > 1035, RFC 5890), which do not allow: > - Domain labels starting or ending with a hyphen (`-`) > Test case that should pass but fails on version 1.10.0: > {code:java} > @Test > void shouldBeInvalid() > { > assertAll(() -> > assertFalse(DomainValidator.getInstance().isValid("-test.fr")), > () -> assertFalse(DomainValidator.getInstance().isValid("-tést.fr")), > () -> assertFalse(DomainValidator.getInstance().isValid("test-.fr")), > () -> assertFalse(DomainValidator.getInstance().isValid("tést-.fr"))); > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)