On Apr 23, 8:01 am, krishnaposti...@gmail.com wrote: > My quick attempt is below: > obj = re.compile(r'\b[0-9|a-zA-Z]+[\w-]+')
1. Provided the remainder of the pattern is greedy and it will be used only for findall, the \b seems pointless. 2. What is the "|" for? Inside a character class, | has no special meaning, and will match a literal "|" character (which isn't part of your stated requirement). 3. \w will match underscore "_" ... not in your requirement. 4. Re [\w-] : manual says "If you want to include a ']' or a '-' inside a set, precede it with a backslash, or place it as the first character" which IIRC is the usual advice given with just about any regex package -- actually, placing it at the end works but relying on undocumented behaviour when there are alternatives that are as easy to use and are documented is not a good habit to get into :-) 5. You have used "+" twice; does this mean a minimum length of 2 is part of your requirement? > >>> re.findall(obj, 'TestThis;1234;Test123AB-x') > > ['TestThis', '1234', 'Test123AB-x'] > > This is not working. > > Requirements: > The text must contain a combination of numbers, alphabets and hyphen > with at least two of the three elements present. Unfortunately(?), regular expressions can't express complicated conditions like that. > I can use it to set > min length using ) {} I presume that you mean enforcing a minimum length of (say) 4 by using {4,} in the pattern ... You are already faced with the necessity of filtering out unwanted matches programmatically. You might as well leave the length check until then. So: first let's establish what the pattern should be, ignoring the "2 or more out of 3 classes" rule and the length rule. First character: Digits? Maybe not. Hyphen? Probably not. Last character: Hyphen? Probably not. Other characters: Any of (ASCII) letters, digits, hyphen. So based on my guesses for answers to the above questions, the pattern should be r"[A-Za-z][-A-Za-z0-9]*[A-Za-z0-9]" Note: this assumes that your data is impeccably clean, and there isn't any such data outside textbooks. You may wish to make the pattern less restrictive, so that you can pick up probable mistakes like "A123- 456" instead of "A123-456". Checking a candidate returned by findall could be done something like this: # initial setup: import string alpha_set = set(string.ascii_letters) digit_set = set('1234567890') min_len = 4 # for example # each candidate: cand_set = set(cand) ok = len(cand) >= min_len and ( bool(cand_set & alpha_set) + bool(cand_set & digit set) + bool('-' in cand_set) ) >= 2 HTH, John -- http://mail.python.org/mailman/listinfo/python-list