DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUGĀ·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=37382>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED ANDĀ·
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=37382

           Summary: stack over flow while using a Regex
           Product: ORO
           Version: 2.0.7
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main
        AssignedTo: [email protected]
        ReportedBy: [EMAIL PROTECTED]
                CC: [EMAIL PROTECTED]


Hi,

I am using ORO Regex API version 2.0.7 and my objective is to extract some 
tagged data from html source. For example i am interested in getting the source 
code for all the forms found in a html page. So i made my regex like this:

Regex formReg = new Regex("(?i)(<form(.|\\s)*?>(.|\\s)*?</form>)");

because following one didn't work,

Regex formReg = new Regex("(?i)(<form.*?>.*?</form>)");

because . is taken as any character but not newline.

So my first regex worked well and i was able to get complete form data starting 
from <form..... to </form>

BUT

when the form was big say like it had around 400 lines and 30K bytes then it 
failed and resulted in Stack Overflow. I am pasting below the stack overflow 
error:

Matched <form name="param" action="http://www/parametric/ProductParametric"; 
method="post">
<input name="sterm" type="hidden">
</form>
matcher.getMatch().endOffset(1) 4480
Matched <form name="cross" action="http://www/crossref/search.jsp"; 
method="post">
<input name="partNumber" type="hidden">
</form>
matcher.getMatch().endOffset(1) 127
java.lang.StackOverflowError
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
        at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)


Also i am pasting my code(method) which i wrote for extraction and it can be 
simply called from main method and run,

----------------------------------------------------------------------------

public static void testRegOro() {
                try {
                        String html = IoUtils.readFile("file.txt");
//                      String html = "all work and no play makes jack a dull 
boy"; //IoUtils.readFile("file.txt");
                        Perl5Compiler compiler=new Perl5Compiler();
                        Perl5Pattern pattern = (Perl5Pattern) compiler.compile
("(<form(.|\\s)*?>(.|\\s)*?</form>)",
                                  Perl5Compiler.CASE_INSENSITIVE_MASK | 
Perl5Compiler.READ_ONLY_MASK);
                        PatternMatcher matcher = new Perl5Matcher();
                        int i=0;
                        while(matcher.contains(html,pattern) && i++<3) {
                        System.out.println("Matched " + matcher.getMatch().group
(1));
                        System.out.println("matcher.getMatch().endOffset(1) " + 
matcher.getMatch().endOffset(1));
                        html = html.substring(matcher.getMatch().endOffset(1));
                        //System.out.println("html " + html);
                      }
                } catch (Throwable e) {
                        e.printStackTrace();
                }
        }

------------------------------------------------------------------------------

As my code shows i am reading a file.txt file i am attaching that file also in 
the bug.

I will really appreciate if you can look into this and throw some light on this 
and if it can be improved?

Thanks in Advance!
Regards,
Pushpesh Kr. Rajwanshi

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to