I am afraid my example was not clear at all! Consider a similar example to the one from the documentation: (1234|3789)ABC If the input is given in two buffers: "ABC123" and "789ABC", pcre_dfa_exec() will return PCRE_PARTIAL for the first buffer and no match for the second even if PCRE_DFA_RESTART is used. This is somehow problematic, since (in run time) the caller gets no indication that a match was missed - "3789ABC".
The suggestion was to return PCRE_PARTIAL_ALTERNATIVE in this case. This will allow the caller (when the second buffer arrives) to construct a temp buffer "123789" (note that in this point the caller knows how much to copy from the end of the first buffer and how much to copy from the second buffer since the length of the alternative part is also known). pcre_dfa_exec() will return PCRE_PARTIAL when it is fed with the temp buffer. The caller can then continue to scan the second buffer from right place (index=3 after the part that was copied to the temp buffer) and the PCRE_DFA_RESTART option. In such a case a match will be returned. If for the same regular expression the input is given in the following buffers: "ABC123789" and "ABC" - i.e. the first buffer ends after the alternative part of the re, then pcre_dfa_exec() should return PCRE_PARTIAL and the caller can continue to the second buffer using PCRE_DFA_RESTART and find the match. The idea is to give an indication to the caller of pcre_dfa_exec() that matches might have been missed and give her a chance to "fix" it. Without such an indication, the Multi Segment Matching is useless (at least when the re contains alternatives) for applications that must not lose matches. Thanks, E. On Sat, Oct 17, 2009 at 6:44 PM, Philip Hazel <[email protected]> wrote: > On Fri, 16 Oct 2009, Eldar Kleiner wrote: > > > For example, suppose the subject is "xxxxmmmmaaammmxxxx" where xxxx > > does not match the re, mmm matches the non alternative part and aaa > > matches the alternative part of the re. If the subject is given in two > > buffers: "xxxxmmm" and "maaammmxxxx" then PCRE_PARTIAL will be > > returned for the first buffer and the caller can continue to the > > second buffer using the PCRE_DFA_RESTART options. Alternatively, if > > the subject is given in the following two buffers: "xxxxmmmma" and > > "aammmxxxx", the caller will get the PCRE_PARTIAL_ALTERNATIVE return > > code. In such a case the caller can construct a temp buffer "mmmmaaam" > > (by copying the end of the first buffer and the start of the second) > > as the longest part of the alternative part of the re is known. For > > the temp buffer PCRE_PARTIAL will be returned and then the caller can > > continue scanning the second buffer from the right offset (3 in our > > example) and find the match. This will limit the amount of copying in > > envs where the techniques you demonstrated is not possible. > > I am sorry, I'm not sure what you mean. Do you perhaps mean "optional" > when you say "alternative"? Otherwise I don't understand why matching > "xxxxmmmma" should give a partial return at all - shouldn't "mmmm" give > a complete match? Perhaps you can give me a real example. > > If you do mean "optional", let me try to understand more. Suppose we > consider the pattern "abcd*". If the first buffer is "xxxabcd" and the > second buffer is "ddddxxxx" then, with PCRE 7.9 you will get a complete > match on the first buffer, and miss out on the longest match. Is this > the problem? > > If so, the problem is supposed to be solved in 8.00 by the addition of > the PCRE_PARTIAL_HARD option, which prefers a partial match over a > complete match. However, you have found a bug! In the -RC1 code, neither > pcre_exec() nor pcre_dfa_exec() returns PCRE_PARTIAL for the pattern > "abcd*" when applied to "xxxabcd". I have spotted the error in the code > in pcre_exec() and will then try to find out what's wrong in > pcre_dfa_exec(). > > Philip > > -- > Philip Hazel > -- ## List details at http://lists.exim.org/mailman/listinfo/pcre-dev
