On 3/12/19 6:00 AM, Peter Otten wrote:
A S wrote:
I think I've seen this question before ;)

In addition to 'other reasons' for @Peter's comment, it is a common ComSc worked-problem or assignment. (in which case, we'd appreciate being told that you/OP is asking for help with "homework")


I am trying to extract all strings in nested parentheses (along with the
parentheses itself) in my .txt file. Please see the sample .txt file that
I have used in this example here:
(https://drive.google.com/open?id=1UKc0ZgY9Fsz5O1rSeBCLqt5dwZkMaQgr).

I have tried and done up three different codes but none of them seems to
be able to extract all the nested parentheses. They can only extract a
portion of the nested parentheses. Any advice on what I've done wrong
could really help!

One approach is to research in the hope that there are already existing tools or techniques which may help/save you from 'reinventing the wheel' - when you think about it, a re-statement of open-source objectives.

How does the Python interpreter break-down Python (text) code into its constituent parts ("tokens") *including* parentheses? Such are made available in (perhaps) a lesser-known corner of the PSL (Python Standard Library). Might you be able to use one such tool?

The ComSc technique which sprang to mind involves "stacks" (a LIFO data structure) and "RPN" (Reverse Polish Notation). Whereas we like people to take their turn when it comes to limited resources, eg to form a "queue" to purchase/pay for goods at the local store, which is "FIFO" (first-in, first-out); a "stack"/LIFO (last-in, first-out) can be problematic to put into practical application. There are plenty of Python implementations or you can 'roll your own' with a list. Again, I'd likely employ a "deque" from the PSL's Collections library (although as a "stack" rather than as a "double-ended queue"), because the optimisation comes "free". (to my laziness, but after some kind soul sweated-bullets to make it fast (in both senses) for 'the rest of us'!)


It's probably easier to understand and implement when you process the
complete text at once. Then arbitrary splits don't get in the way of your
quest for ( and ). You just have to remember the position of the first
opening ( and number of opening parens that have to be closed before you
take the complete expression:

+1
but as a 'silver surfer', I don't like to be asked to "remember" anything!


level:  00011112222100
     we need^


Consider:
original_text (the contents of the .txt file - add buffering if volumes are huge)
current_text (the characters we have processed/"recognised" so-far)
stack (what an original name for such a data-structure! Which will contain each of the partial parenthetical expressions found - but yet to be proven/made complete)

set current_text to NULSTRING
for each current_character in original_text:
        if current_character is LEFT_PARENTHESIS:
                push current_text to stack
                set current_text to LEFT_PARENTHESIS
        concatenate current_character with current_text
        if current_character is RIGHT_PARENTHESIS:
                # current_text is a parenthetical expression
                # do with it what you will
                pop the stack
                set current_text to the ex-stack string \
                        concat current_text's p-expn

Once working: cover 'special cases' (after above loop), eg original_text which doesn't begin and/or end with parentheses; and error cases, eg pop-ping a NULSTRING, or thinking things are finished but the stack is not yet empty - likely events from unbalanced parentheses!

original text = "abc(def(gh))ij"

event 1: in-turn, concatenate characters "abc" as current_text
event 2: locate (first) left-parenthesis, push current_text to stack(&)
event 3: concatenate "(def"
event 4: push, likewise
event 5: concatenate "(gh"
event 6: locate (first) right-parenthesis (matches to left-parenthesis begining the current_string!)
result?: ?print current_text?
event 7: pop stack and redefine current_text as "(def(gh)"
event 8: repeat, per event 6
event 9: current_text will now become "(def(gh))"
event 10: (special case) run-out of input at "(def(gh)ij"
event 11: (special case) pop (!any) stack and report "abc(def(gh)"


NB not sure of need for a "level" number; but if required, you can infer that at any time, from the depth/len() of the stack!

NBB being a simple-boy, my preference is for a 'single layer' of code, cf @Peter's generator. Regardless the processes are "tokenisation" and "recognition".

At the back of my mind, was the notion that you may (eventually) be required to work with more than parentheses, eg pair-wise square-brackets and/or quotation-marks. In which case, you will need to also 'push' the token and check for token-pairs when 'pop-ping', as well as (perhaps) recognising lists of tokens to tokenise instead of the two parenthesis characters alone. In which case, I'd take a serious look at the Python Language Services rather than taking a 'roll your own' approach!

Contrarily, if this spec is 'it', then you might consider optimising the search processes which 'trigger' the two stack operations, by re-working the for-loop and utilising string.find() - prioritising whichever parenthesis is left-most/comes-first - assuming LtR text. (apologies if you have already tried this in one of your previous approaches) Unfortunately, such likely results in 'layers' of code, and a generator might well become the tool-of-choice (I say this before @Peter comes back and (quite deservedly) flays me alive!).


WebRefs:
Python Language Services: https://docs.python.org/3/library/language.html
collections — Container datatypes: https://docs.python.org/3/library/collections.html

See also your ComSc text/reference materials.
--
Regards =dn
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to