On Tuesday, 3 December 2019 23:48:21 UTC+8, Peter Otten wrote: > A S wrote: > > > On Tuesday, 3 December 2019 01:01:25 UTC+8, Peter Otten wrote: > >> A S wrote: > >> > >> I think I've seen this question before ;) > >> > >> > I am trying to extract all strings in nested parentheses (along with > >> > the parentheses itself) in my .txt file. Please see the sample .txt > >> > file that I have used in this example here: > >> > (https://drive.google.com/open?id=1UKc0ZgY9Fsz5O1rSeBCLqt5dwZkMaQgr). > >> > > >> > I have tried and done up three different codes but none of them seems > >> > to be able to extract all the nested parentheses. They can only extract > >> > a portion of the nested parentheses. Any advice on what I've done wrong > >> > could really help! > >> > > >> > Here are the three codes I have done so far: > >> > > >> > 1st attempt: > >> > > >> > import re > >> > from os.path import join > >> > > >> > def balanced_braces(args): > >> > parts = [] > >> > for arg in args: > >> > if '(' not in arg: > >> > continue > >> > >> There could still be a ")" that you miss > >> > >> > chars = [] > >> > n = 0 > >> > for c in arg: > >> > if c == '(': > >> > if n > 0: > >> > chars.append(c) > >> > n += 1 > >> > elif c == ')': > >> > n -= 1 > >> > if n > 0: > >> > chars.append(c) > >> > elif n == 0: > >> > parts.append(''.join(chars).lstrip().rstrip()) > >> > chars = [] > >> > elif n > 0: > >> > chars.append(c) > >> > return parts > >> > >> It's probably easier to understand and implement when you process the > >> complete text at once. Then arbitrary splits don't get in the way of your > >> quest for ( and ). You just have to remember the position of the first > >> opening ( and number of opening parens that have to be closed before you > >> take the complete expression: > >> > >> level: 00011112222100 > >> text: abc(def(gh))ij > >> when we are here^ > >> we need^ > >> > >> A tentative implementation: > >> > >> $ cat parse.py > >> import re > >> > >> NOT_SET = object() > >> > >> def scan(text): > >> level = 0 > >> start = NOT_SET > >> for m in re.compile("[()]").finditer(text): > >> if m.group() == ")": > >> level -= 1 > >> if level < 0: > >> raise ValueError("underflow: more closing than opening > >> parens") > >> if level == 0: > >> # outermost closing parenthesis: > >> # deliver enclosed string including parens. > >> yield text[start:m.end()] > >> start = NOT_SET > >> elif m.group() == "(": > >> if level == 0: > >> # outermost opening parenthesis: remember position. > >> assert start is NOT_SET > >> start = m.start() > >> level += 1 > >> else: > >> assert False > >> if level > 0: > >> raise ValueError("unclosed parens remain") > >> > >> > >> if __name__ == "__main__": > >> with open("lan sample text file.txt") as instream: > >> text = instream.read() > >> for chunk in scan(text): > >> print(chunk) > >> $ python3 parse.py > >> ("xE'", PUT(xx.xxxx.),"'") > >> ("TRUuuuth") > > > > Hello Peter! I tried this on my actual working files and it returned this > > error: "unclosed parens remain". In this case, how can I continue to parse > > through my text files by only extracting those with balanced parentheses > > and ignore those that are incomplete? > > filenames = ... > for filename in filenames: > with open(filename) as instream: > text = instream.read() > try: > chunks = list(scan(text)) > except ValueError as err: > print(f"{err} in file {filename!r}", file=sys.stderr) > else: > for chunk in chunks: > print(chunk)
hey Peter, it works! Thank you :) -- https://mail.python.org/mailman/listinfo/python-list