Re: File Chopping Algorithm

John H. Robinson, IV Tue, 12 Dec 2006 16:03:53 -0800

Todd Walton wrote:
> 
> What I can assume about these files is that each will have three
> pre-defined blocks of text, enclosed by HTML style tags.  The tags are
> on their own line.  There may or may not be text outside of these
> three blocks.  There may or may not be blank lines between the blocks.
> The blocks may or may not be in a given order.  Etc.


If you know that the tags are of the <::LABEL::> </::LABEL::> variety,
where ::LABEL:: is a unique string that matches from open to close then
this is easy.

Assuming, of course, that there exist no other tags. If you know the
::LABEL::s in advance, then that is so much the easier.

If you know that the three-section delimiters are the only sgml/html
style tags, you are okay. If there could be others, then there could
very easily be problems.

In html, tags can nest. If you might see something like:

<Sec1>
This is section one. In it, we have text like
<Sec1>
to indicate to the end user what the tag looks like, so when the section
is over, we would enter the closing tag on a line by itself:
</Sec1>

Easy, isn't it?
</Sec1>

Okay, contrived, but it gets the point across. If you can make a
guarantee that you will never see that, then it's rather simple:

for each line do:
  is LABEL known?
    yes: does line match the regular expression ^</LABEL>$ ?
      yes: forget LABEL
           increment SECTION
      no: write line to SECTION
    no:
      does line match the regular expression ^<(.*)>$ ?
        no: end loop
        yes: remember the bit inside the <> for later. call it LABEL
end.


I did not desk check this algorithm, flavour to suit your intended
language. It will break if the assumptions I made above are false.

-john


-- 
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list

Re: File Chopping Algorithm

Reply via email to