Re: Extract sentences in nested parentheses using Python

DL Neil via Python-list Mon, 02 Dec 2019 14:27:06 -0800

On 3/12/19 6:00 AM, Peter Otten wrote:

A S wrote:
I think I've seen this question before ;)

In addition to 'other reasons' for @Peter's comment, it is a commonComSc worked-problem or assignment. (in which case, we'd appreciatebeing told that you/OP is asking for help with "homework")

I am trying to extract all strings in nested parentheses (along with the
parentheses itself) in my .txt file. Please see the sample .txt file that
I have used in this example here:
(https://drive.google.com/open?id=1UKc0ZgY9Fsz5O1rSeBCLqt5dwZkMaQgr).

I have tried and done up three different codes but none of them seems to
be able to extract all the nested parentheses. They can only extract a
portion of the nested parentheses. Any advice on what I've done wrong
could really help!

One approach is to research in the hope that there are already existingtools or techniques which may help/save you from 'reinventing the wheel'- when you think about it, a re-statement of open-source objectives.

How does the Python interpreter break-down Python (text) code into itsconstituent parts ("tokens") *including* parentheses? Such are madeavailable in (perhaps) a lesser-known corner of the PSL (Python StandardLibrary). Might you be able to use one such tool?

The ComSc technique which sprang to mind involves "stacks" (a LIFO datastructure) and "RPN" (Reverse Polish Notation). Whereas we like peopleto take their turn when it comes to limited resources, eg to form a"queue" to purchase/pay for goods at the local store, which is "FIFO"(first-in, first-out); a "stack"/LIFO (last-in, first-out) can beproblematic to put into practical application. There are plenty ofPython implementations or you can 'roll your own' with a list. Again,I'd likely employ a "deque" from the PSL's Collections library (althoughas a "stack" rather than as a "double-ended queue"), because theoptimisation comes "free". (to my laziness, but after some kind soulsweated-bullets to make it fast (in both senses) for 'the rest of us'!)

It's probably easier to understand and implement when you process the
complete text at once. Then arbitrary splits don't get in the way of your
quest for ( and ). You just have to remember the position of the first
opening ( and number of opening parens that have to be closed before you
take the complete expression:


+1
but as a 'silver surfer', I don't like to be asked to "remember" anything!

level:  00011112222100
     we need^



Consider:

original_text (the contents of the .txt file - add buffering if volumesare huge)

current_text (the characters we have processed/"recognised" so-far)

stack (what an original name for such a data-structure! Which willcontain each of the partial parenthetical expressions found - but yet tobe proven/made complete)


set current_text to NULSTRING
for each current_character in original_text:
        if current_character is LEFT_PARENTHESIS:
                push current_text to stack
                set current_text to LEFT_PARENTHESIS
        concatenate current_character with current_text
        if current_character is RIGHT_PARENTHESIS:
                # current_text is a parenthetical expression
                # do with it what you will
                pop the stack
                set current_text to the ex-stack string \
                        concat current_text's p-expn

Once working: cover 'special cases' (after above loop), eg original_textwhich doesn't begin and/or end with parentheses; and error cases, egpop-ping a NULSTRING, or thinking things are finished but the stack isnot yet empty - likely events from unbalanced parentheses!


original text = "abc(def(gh))ij"

event 1: in-turn, concatenate characters "abc" as current_text
event 2: locate (first) left-parenthesis, push current_text to stack(&)
event 3: concatenate "(def"
event 4: push, likewise
event 5: concatenate "(gh"

event 6: locate (first) right-parenthesis (matches to left-parenthesisbegining the current_string!)

result?: ?print current_text?
event 7: pop stack and redefine current_text as "(def(gh)"
event 8: repeat, per event 6
event 9: current_text will now become "(def(gh))"
event 10: (special case) run-out of input at "(def(gh)ij"
event 11: (special case) pop (!any) stack and report "abc(def(gh)"

NB not sure of need for a "level" number; but if required, you can inferthat at any time, from the depth/len() of the stack!

NBB being a simple-boy, my preference is for a 'single layer' of code,cf @Peter's generator. Regardless the processes are "tokenisation" and"recognition".

At the back of my mind, was the notion that you may (eventually) berequired to work with more than parentheses, eg pair-wisesquare-brackets and/or quotation-marks. In which case, you will need toalso 'push' the token and check for token-pairs when 'pop-ping', as wellas (perhaps) recognising lists of tokens to tokenise instead of the twoparenthesis characters alone. In which case, I'd take a serious look atthe Python Language Services rather than taking a 'roll your own' approach!

Contrarily, if this spec is 'it', then you might consider optimising thesearch processes which 'trigger' the two stack operations, by re-workingthe for-loop and utilising string.find() - prioritising whicheverparenthesis is left-most/comes-first - assuming LtR text. (apologies ifyou have already tried this in one of your previous approaches)Unfortunately, such likely results in 'layers' of code, and a generatormight well become the tool-of-choice (I say this before @Peter comesback and (quite deservedly) flays me alive!).



WebRefs:
Python Language Services: https://docs.python.org/3/library/language.html

collections — Container datatypes:https://docs.python.org/3/library/collections.html


See also your ComSc text/reference materials.
--
Regards =dn
--
https://mail.python.org/mailman/listinfo/python-list

Re: Extract sentences in nested parentheses using Python

Reply via email to