RFC[02] - nZyme C++ code segment

George Makrydakis Tue, 21 Mar 2006 14:00:09 -0800

Hi guys, sorry for top - posting. I know you are all busy with jhalfs; this is 
an interim update towards a more functional version of the C++ parsing codebase.

This is a more recent version of the project, upon which conversion to OOP is to be based. It becomes evident that the next stage of the development will behaving two separate "structures" working together: the tokenizer and the element parser. So far effort is concentrated to finding the least amount of codenecessary using the C++ STL string, without coding a "classical" FSM.

The current version will raise a "fatality" if element +/ attribute definition is incorrect, it will not check whether the element name is valid (for now, thisis just another statement to add which I deliberately avoided). The most important thing about this version is that it has error control for spurious <,>characters within the document, and there is strictness when parsing element name + attributes (whitespaces and double quotes are respected else it raises anerror saying the document is incorrect). This is from march 17th branch, current branch needs fixing before posting for evaluation purposes (includes the entitydata structures and the rest and needs more testing before posting). I think that I should keep things up to date, with where this is heading, this is the whyof the posting.

It is possible to trim everything down even more, so with the addition of the entity data - structures and the XPointer features it should be able to provideone of the most compact solutions ever.


background info:

http://linuxfromscratch.org/pipermail/alfs-discuss/2006-March/007760.html


Note: CDATA sections are not supported for the time being (easy to add, simply 
the LFS book does not have them).

Thank you, Comments are awaited.

George Makrydakis

---------------------------------------------------------




        // nZyme "parsing" project
        // author: George Makrydakis gmakmail a/t gmail d0t com
        // license: BSD
        // release: revision A5 POC code base
        // scheduling: conversion to OOP class / template structures.
        // labdate:     March 17th 2006

        // WATCH THE LINES WHILE CUT/PASTE
        
        #include <iostream>
        #include <fstream>
        #include <string>
        #include <vector>

        using namespace std;

        int main (int argc, char **argv)
        {

                // despite some inherent inefficiences of the C++ STL string 
template, we will be using it for parsing XML
                // keeping in mind that we can actually create a more efficient 
"string" than the STL itself.

                // version: revision A5 - error control implemented, any 
incorrect syntax should trigger fatal events.
                // TODO: complete validation and incorporate <! full support 
with the necessary data structure for entity dereferencing.
        
        vector<string> xmlVECT; // vector containing separate raw line segments 
from the XML file
        vector<string> xmlITEM; // vector containing formatted string tokens as 
out of the tokenizer

        string xmlBUFF; // holds a buffered string
        string dtdROOT; // holds the root element name
        string xmlLINE; // holds a single member of the xmlVECT vector
        string dtdBUFF; // holds a buffered string while processing DTD
        string xmlTOKN; // holds a formatted XML string token out of the 
xmlLINE string
        string xmlCOMM; // holds an unformatted XML string free of comments 
when those are met, it is a buffer actually.
        string tryme;
        string tempobuffer;

        int lnct;               // holds a line counter variable
        int cTAG;               // within a given string, index to a usable within 
code segments '<' character
        int oTAG;               // within a given string, index to a usable within 
code segments '>' character
        int sTAG;               // within a given string, index to a usable 
whitespace or non whitespace sequence
        int lnct2;              // holds a line counter variable

        if (argc != 2)
        {
                printf("Usage: %s [XML FILE]\n", argv[0]); // note the 
difference; not always working but 90% of the time, getting close to 100%!
                return(-1);
        }
        ifstream xmlFILE(argv[1]);

        if ( xmlFILE.is_open() )
        {
                while (getline(xmlFILE,xmlBUFF,'\n'))
                {
                        
xmlVECT.push_back(xmlBUFF.erase(0,xmlBUFF.find_first_not_of(" \t\n\r\v")));
                }
                xmlFILE.close();
                xmlBUFF.clear();
        }
        else
        {
                cout << "file not found!" << endl;
                return -1;
        }
        // two portions within one program justify some ahead - planning: 
tokenizing != parsing, so you kind of get the idea how
        // to create the OOP structure correctly from the POC code
        for (lnct = 0; lnct < xmlVECT.size(); lnct++)
        {
                xmlLINE = xmlVECT.at(lnct);
                
                if (xmlLINE.find("<!DOCTYPE") != string::npos)
                {
                        xmlLINE = xmlLINE.substr(xmlLINE.find("<!DOCTYPE") + 9);
                        xmlLINE = xmlLINE.erase(0, xmlLINE.find_first_not_of(" 
\t\n\r\v"));
                        lnct2 = lnct;
                        while (dtdROOT.empty())
                        {
                                sTAG = xmlLINE.find_first_not_of(" \t\n\r\v");
                                if (sTAG != string::npos)
                                {
                                        dtdBUFF = xmlLINE.substr(sTAG, 
xmlLINE.find_first_of(" \t\n\r\v"));
                                        dtdBUFF = dtdBUFF.erase(0, 
dtdBUFF.find_first_not_of(" \t\n\r\v"));
                                        xmlLINE = xmlLINE.substr(sTAG + dtdBUFF.size(), 
xmlLINE.find_first_of(" \t\n\r\v"));
                                        if 
((dtdBUFF.find_first_of("</[]\\'\"&;>:") == string::npos))
                                        {
                                                if (!(( dtdBUFF == "PUBLIC") || ( dtdBUFF 
== "SYSTEM")))
                                                {
                                                        while (xmlLINE.find("<" 
+ dtdBUFF) == string::npos)
                                                        {
                                                                xmlLINE = xmlLINE + 
"  " + xmlVECT.at(lnct2);
                                                                if ( 
xmlLINE.find("<!--") != string::npos )
                                                                {
                                                                        xmlCOMM = 
xmlLINE.substr(0, xmlLINE.find("<!--"));
                                                                        while 
(xmlLINE.find("-->") == string::npos)
                                                                        {
                                                                                
lnct2++;
                                                                                
xmlLINE = xmlVECT.at(lnct2);
                                                                        }
                                                                        xmlLINE = 
xmlLINE.substr(xmlLINE.find("-->") + 3);
                                                                        
xmlLINE= xmlCOMM + xmlLINE;
                                                                        
xmlCOMM.clear();
                                                                }
                                                                lnct2++;
                                                        }
                                                        lnct = lnct2;
                                                        //xmlITEM.push_back("<!DOCTYPE " + 
xmlLINE.substr(0, xmlLINE.find("<" + dtdBUFF)));
                                                        xmlLINE = 
xmlLINE.substr(xmlLINE.find("<" + dtdBUFF)) + xmlVECT.at(lnct);
                                                        dtdROOT = dtdBUFF;
                                                }
                                else if (( dtdBUFF == "PUBLIC") || ( dtdBUFF == "SYSTEM") || 
(dtdBUFF.find_first_of("</[]\\'\"&;>:") == string::npos))
                                                {
                                                        cout << "FATALITY: root element 
not declared within DOCTYPE statement!" << endl;
                                                        return 1;
                                                }
                                        }
                                }
                                lnct2++;
                                if (dtdROOT.empty()) {xmlLINE = 
xmlVECT.at(lnct2);}
                        }
                }
                        while (!xmlLINE.empty())
                        {
                                if (!xmlCOMM.empty()) {xmlLINE = xmlCOMM + 
xmlLINE; xmlCOMM.clear();}
                                if (!xmlBUFF.empty()) {xmlLINE = xmlBUFF + " " 
+ xmlLINE; xmlBUFF.clear();}
                                if ( xmlLINE.find("<!--") != string::npos )
                                {
                                        xmlCOMM = xmlLINE.substr(0, 
xmlLINE.find("<!--"));
                                        while (xmlLINE.find("-->") == 
string::npos)
                                        {
                                                lnct++;
                                                xmlLINE = xmlVECT.at(lnct);
                                        }
                                                xmlLINE = 
xmlLINE.substr(xmlLINE.find("-->") + 3);
                                                xmlCOMM = xmlCOMM + xmlLINE;
                                                xmlLINE.clear();
                                }
                                
                                cTAG = xmlLINE.find(">");
                                oTAG = xmlLINE.find("<");
                                if ((oTAG == string::npos) || (cTAG == 
string::npos))
                                {
                                        if ((oTAG == string::npos) && (cTAG == 
string::npos) && !xmlLINE.empty())
                                        {
                                                xmlITEM.push_back(xmlLINE);
                                                xmlLINE.clear();
                                                break;
                                        }
                                        else if ((oTAG != string::npos) && 
(cTAG == string::npos))
                                        {
                                                xmlBUFF = xmlLINE.substr(oTAG);
                                                
xmlITEM.push_back(xmlLINE.substr(0, oTAG));
                                                xmlLINE.clear();
                                                break;                          
                
                                        }
                                        else if ((oTAG == string::npos) && 
(cTAG != string::npos))
                                        {
                                                cout << "FATALITY: a spurious > sign has 
been found!" << endl;
                                                return (-1);
                                        }
                                }
                                else
                                {       
                                        if ((cTAG - oTAG) > 0)
                                        {
                                                        xmlTOKN = 
xmlLINE.substr(0, oTAG);
                                                        if 
(!xmlTOKN.empty()){xmlITEM.push_back(xmlTOKN);}
                                                        xmlTOKN = 
xmlLINE.substr(oTAG, cTAG + 1 - oTAG);
                                                        if (!(xmlTOKN.find("<") == 
xmlTOKN.find_last_of("<")))
                                                        {
                                                                cout << "FATALITY: A spurious 
< sign has been found!" << endl;
                                                                return(-1);
                                                        }
                                                        if 
(!xmlTOKN.empty()){xmlITEM.push_back(xmlTOKN);}
                                                        xmlLINE = 
xmlLINE.substr(cTAG + 1);
                                        }
                                        else
                                        {
                                                cout << "FATALITY: a spurious > sign has 
been found!" << endl;
                                                return (-1);
                                        }
                                }
                        }
        }
        // this section will be a separate structure...
        for (lnct = 0; lnct < xmlITEM.size(); lnct++)
        {
                        xmlLINE = xmlITEM.at(lnct);
                        if (xmlLINE.find("<") != string::npos)
                        {
                                // ok we have a semantically important 
structure, we now need to classify it
                                //
                                // 1. element closure
                                // 2. element without attributes
                                // 3. element with attributes
                                // 4. element EMPTY, no attributes
                                // 5. element EMPTY, with attributes

                                if (xmlLINE.find(" ") == string::npos) // if no 
whitespaces are found, then either open / close / empty, no attributes
                                {
                                        if (xmlLINE.find("</") != string::npos)
                                        {
                                                tryme = xmlLINE.substr(2, 
xmlLINE.size() - 3);
                                                cout << "CLOSING ELEMENT:" + tryme 
<< endl;

                                        }
                                        else if (xmlLINE.find("/>") != 
string::npos)
                                        {
                                                tryme = xmlLINE.substr(1, 
xmlLINE.size() - 3);
                                                cout << "EMPTY ELEMENT:" + tryme 
<< endl;
                                        }
                                        else
                                        {
                                                tryme = xmlLINE.substr(1, 
xmlLINE.size() - 2);
                                                cout << "OPENING ELEMENT:" + tryme 
<< endl;
                                        }
                                }
                                else // whitespaces are found, so we have 
attributes contained!
                                {
                                        if (xmlLINE.find("/>") != string::npos)
                                        {
                                                tryme = xmlLINE.substr(1, 
xmlLINE.find(" "));
                                                cout << "OPENING EMPTY ELEMENT WITH 
ATTRIBUTES:" + tryme << endl;
                                                xmlLINE = xmlLINE.substr(0, 
xmlLINE.find("/>"));
                                        }
                                        else
                                        {
                                                tryme = xmlLINE.substr(1, 
xmlLINE.find(" "));
                                                cout << "OPENING ELEMENT WITH 
ATTRIBUTES:" + tryme << endl;
                                                xmlLINE = xmlLINE.substr(0, 
xmlLINE.find(">"));
                                        }
                                        string rawseq;
                                        rawseq = 
xmlLINE.substr(xmlLINE.find(tryme) + tryme.size());
                                        int startQUOTE;
                                        int stopsQUOTE;
                                        string attributeNAME;
                                        string attributeVALUE;
                                        while (!rawseq.empty())
                                        {
                                                rawseq = rawseq.erase(0, 
rawseq.find_first_not_of(" \t\n\r\v"));
                                                startQUOTE = rawseq.find("\"");
                                                if (startQUOTE != string::npos)
                                                {
                                                        stopsQUOTE = 
rawseq.find("\"", startQUOTE + 1);
                                                        if (stopsQUOTE != 
string::npos)
                                                        {
                                                                attributeNAME = 
rawseq.substr(0, startQUOTE);
                                                                // lets 
"validate" the name shall we...
                                                                // find the = 
character
                                                                int attrpos = 
attributeNAME.find("=");
                                                                if ((attrpos != string::npos) 
&& (attrpos == attributeNAME.find_last_of("=")))
                                                                {
                                                                        
attributeNAME = attributeNAME.substr(0, attrpos);
                                                                        // now 
do a preventive whitespace trim...

                                        attributeNAME = 
attributeNAME.erase(attributeNAME.find_last_not_of(" \t") + 1, 
attributeNAME.find_last_of(" \t"));
                                                                        if 
(attributeNAME.find_first_of(" \t\/\'();") != string::npos)
                                                                        {
                                                                                cout << 
"FATALITY: irregularities met during element parsing" << endl;
                                                                                
return(-1);
                                                                        }
                                                                        
tempobuffer = rawseq.substr(attrpos + 1, startQUOTE - attrpos - 1 );
                                                                        if 
(tempobuffer.find_first_not_of(" \t") != string::npos)
                                                                        {
                                                                                cout << 
"FATALITY: irregularities met during element parsing" << endl;
                                                                                
return(-1);
                                                                        }
                                                                        
attributeVALUE = rawseq.substr(startQUOTE + 1, stopsQUOTE - startQUOTE - 1);
                                                                        rawseq 
= rawseq.substr(stopsQUOTE + 1);
                                                                        cout << "\tNAME:" + 
attributeNAME + "\t VALUE:" + attributeVALUE << endl;
                                                                }
                                                                else
                                                                {
                                                                        cout << "FATALITY: 
irregularities met during element parsing" << endl;
                                                                        
return(-1);
                                                                }
                                                        }
                                                        else
                                                        {
                                                                        cout << "FATALITY: 
irregularities met during element parsing" << endl;
                                                                        
return(-1);
                                                        }
                                                }
                                                else
                                                {
                                                                        cout << "FATALITY: 
irregularities met during element parsing" << endl;
                                                                        
return(-1);
                                                }
                                                if (rawseq == "?") rawseq.clear(); // 
<? ?> command - stuff has not been corrected yet, in process.
                                                // the above part of the code 
is put simply to avoid the subtlety for now...
                                                //loop ends
                                        }
                                        
                                        
                                }
                        }
        }

        xmlVECT.clear();
        xmlITEM.clear();
        return 0;
        }

        

--
http://linuxfromscratch.org/mailman/listinfo/alfs-discuss
FAQ: http://www.linuxfromscratch.org/faq/
Unsubscribe: See the above information page

RFC[02] - nZyme C++ code segment

Reply via email to