Re: [Development] QRegularExpression -- first round of API review

2012-01-17 Thread Thiago Macieira
On Tuesday, 17 de January de 2012 02.43.33, Giuseppe D'Angelo wrote:
 3) Need a couple of better names for partial matching:
 - partial match preferring a complete match
 - partial match returning whatever match (partial or complete) is found
 first

 PCRE uses Soft and Hard respectively (which are quite meaningless to me).

You have a method with two option flags:
   QRegularExpressionMatch match(const QString subject,
  int offset= 0,
  MatchType matchType   = NormalMatch,
  MatchOptions matchOptions = NoMatchOption)
const;

Why can't you fold the partial match into the matchOptions as
AllowPartialMatchOption ?


--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] QRegularExpression -- first round of API review

2012-01-17 Thread Thiago Macieira
On Tuesday, 17 de January de 2012 02.43.33, Giuseppe D'Angelo wrote:
 2) Support for named capturing groups with duplicated names. The point
 is that this may require a couple of additional accessors inside
 QRegularExpressionMatch to extract all the substrings captured by a
 given name.
 The long story is that with default options, having a name occurring
 in different named capturing groups is illegal, unless either
 - the proper pattern option is set;
 - the named capturing groups with the same name have the same number too*;
 - the (?J) option appears in the pattern string enabling duplicated names.
 * (a named capturing group still follows the ordinary, progressive
 numbering; a branch group (?|...) that resets the number in each
 branch allows to have capturing groups with the same name and
 number).

 Opinions for this feature?

Keep it disabled for the moment. Having the same name in more than one group
would be undefined behaviour until we define it. You can even write that in the
documentation.

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] QRegularExpression -- first round of API review

2012-01-17 Thread Thiago Macieira
On Tuesday, 17 de January de 2012 02.43.33, Giuseppe D'Angelo wrote:
 1) Support for more pattern options and match options. Right now you
 can see them commented in the .h (and there's no code for them), but
 adding them is trivial. Any opinions for just enabling them or only
 some? In particular, InvertedGreedinessOption for giving a (more or
 less) direct replacement of QRegExp::setMinimal, and
 AllowDuplicatedNamesOption (see the next point) for duplicated names
 in named capturing groups.

Call it NonGreedyOption and I think that should be enabled.

UseUnicodePropertiesOption: can we reduce the size of the library in certain
scenarios if we don't have this option?

DontCaptureOption: is this a memory saving mechanism? What else does it do,
besides making the capture groups return empty or invalid? If it's just a
memory saving option, I'd say let it in.

MatchFirstLineOnlyOption and DollarMatchesOnlyAtEndOption: I can't tell from
the name only. How do they interact with the MultilineOption? My guess is to
leave them out for now. We can always introduce them later.
--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] QRegularExpression -- first round of API review

2012-01-17 Thread Charley Bay

  3) Need a couple of better names for partial matching:
  - partial match preferring a complete match
  - partial match returning whatever match (partial or complete) is found
  first
 
  PCRE uses Soft and Hard respectively (which are quite meaningless to
  me).

 PreferCompleteMatch / PreferFirstMatch?


I so very much prefer generic-to-specific in names for better
auto-completion and grouping of like-minded, like:

  - PreferMatchComplete
  - PreferMatchFirst

--charley
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] QRegularExpression -- first round of API review

2012-01-17 Thread Giuseppe D'Angelo
2012/1/17 Thiago Macieira thiago.macie...@intel.com:
 On Tuesday, 17 de January de 2012 02.43.33, Giuseppe D'Angelo wrote:
 1) Support for more pattern options and match options. Right now you
 can see them commented in the .h (and there's no code for them), but
 adding them is trivial. Any opinions for just enabling them or only
 some? In particular, InvertedGreedinessOption for giving a (more or
 less) direct replacement of QRegExp::setMinimal, and
 AllowDuplicatedNamesOption (see the next point) for duplicated names
 in named capturing groups.

 Call it NonGreedyOption and I think that should be enabled.

The name InvertedGreedinessOption came from the fact that it inverts
the greediness of the quantifiers: * becomes ungreedy, *? becomes
greedy (the same applies for all other quantifiers). I'll add it
though.

 UseUnicodePropertiesOption: can we reduce the size of the library in certain
 scenarios if we don't have this option?

I don't think so, I think we still need PCRE to have its Unicode
tables for having case insensitive matches.

 DontCaptureOption: is this a memory saving mechanism? What else does it do,
 besides making the capture groups return empty or invalid? If it's just a
 memory saving option, I'd say let it in.

It simply disables the non-named capturing groups, so they simply
group but not capture substrings any more. Named capturing groups
still work, though. I see it as both an optimizer and syntactic sugar,
avoiding you to write lots of (?:...) groups if you don't really need
to capture anything but only group.

 MatchFirstLineOnlyOption and DollarMatchesOnlyAtEndOption: I can't tell from
 the name only. How do they interact with the MultilineOption? My guess is to
 leave them out for now. We can always introduce them later.

Great.

Thanks,
-- 
Giuseppe D'Angelo
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] QRegularExpression -- first round of API review

2012-01-17 Thread Thiago Macieira
On Tuesday, 17 de January de 2012 15.15.15, Giuseppe D'Angelo wrote:
 2012/1/17 Thiago Macieira thiago.macie...@intel.com:
  On Tuesday, 17 de January de 2012 02.43.33, Giuseppe D'Angelo wrote:
  1) Support for more pattern options and match options. Right now you
  can see them commented in the .h (and there's no code for them), but
  adding them is trivial. Any opinions for just enabling them or only
  some? In particular, InvertedGreedinessOption for giving a (more or
  less) direct replacement of QRegExp::setMinimal, and
  AllowDuplicatedNamesOption (see the next point) for duplicated names
  in named capturing groups.
 
  Call it NonGreedyOption and I think that should be enabled.

 The name InvertedGreedinessOption came from the fact that it inverts
 the greediness of the quantifiers: * becomes ungreedy, *? becomes
 greedy (the same applies for all other quantifiers). I'll add it
 though.

Keep it as InvertedGreedinessOption then. I thought it only affected * by
making it be like *?.

  UseUnicodePropertiesOption: can we reduce the size of the library in
  certain scenarios if we don't have this option?

 I don't think so, I think we still need PCRE to have its Unicode
 tables for having case insensitive matches.

If there's no gain, then we should let it in.

What does this option cause the engine to do?

  DontCaptureOption: is this a memory saving mechanism? What else does it
  do,
  besides making the capture groups return empty or invalid? If it's just a
  memory saving option, I'd say let it in.

 It simply disables the non-named capturing groups, so they simply
 group but not capture substrings any more. Named capturing groups
 still work, though. I see it as both an optimizer and syntactic sugar,
 avoiding you to write lots of (?:...) groups if you don't really need
 to capture anything but only group.

But what's the benefit? I've done a lot of () grouping without the ?: sugar
to make it not capture. Does it make the matching happen faster? I doubt it.
Does it give a memory benefit by not capturing?

  MatchFirstLineOnlyOption and DollarMatchesOnlyAtEndOption: I can't tell
  from the name only. How do they interact with the MultilineOption? My
  guess is to leave them out for now. We can always introduce them later.

 Great.

But how do they interact with the MultilineOption? Maybe that will affect the
option that is there.

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] QRegularExpression -- first round of API review

2012-01-11 Thread Giuseppe D'Angelo
Up?
Nobody wants to discuss this?

Cheers,
--
Giuseppe D'Angelo
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] QRegularExpression -- first round of API review

2012-01-11 Thread Thiago Macieira
On Wednesday, 11 de January de 2012 12.49.27, Giuseppe D'Angelo wrote:
 Up?
 Nobody wants to discuss this?

It's above the bikeshed threshold :-)

First, you'll need to get a Nokian to import the PCRE sources. You cannot
submit them to Gerrit (not even to the commit you made) because that violates
the CLA. You're not the author. Please don't submit the PCRE code again --
just pretend it's there.

As for me, sorry I didn't review. It fell through the cracks of weekend is
coming.

The API looks now a lot more digestible. There are still a few methods that I
will need the documentation for, as I can't guess what they are from their
name (subject is probably an RE term that I don't know). The API around the
captured texts may need a few more rounds of discussion. The name cap
appeared in Qt 3 and if we're not able to keep source compatibility with Qt 4
anyway, maybe it's time to fix it too. The iterating methods, which are the
cool thing about this API, seem to be lost. I don't see how to get the
contents of that match.

Specific questions:

 *   QRegularExpressionMatch::captureCount returns actually the highest
 index of a capturing group that matched something. Ideas?
 (lastCapturedIndex?)

It seems that they are the same thing. captureCount looks fine if the other
methods also have capture in the name. Does this return the number of named
captures too? E.g. imagine I have two named captures in my RE and nothing
else. If they match, will that return 2?

If my RE has a capture that is optional and fails to match, how do I find out?
Imagine:

rx = /(foo)?(bar)/
rx =~ bar

In this case, the first capture failed to match anything. How do I know that in
the API?

 *   What should QRegularExpressionMatch::subjectOffset return when one
 advances a match (f.i. by using
 QRegularExpressionMatch::operator++)? The offset at which
 logically the match is re-attempted (which is the ending position
 of the current match + 1) or the one at which it is REALLY
 attempted, which could be one or two charaters ahead, if the old
 match matched an empty string? (Cf. the discussion in my last mail
 about attempting /g matches against patterns that can match an empty
 string)

I don't get the question because I don't know what a subject is, so I don't
know what a subject offset is supposed to be. Still, think about the use-case:
would someone need this offset? If so, why do they need it? What do they need
it for? Hopefully, that will help you answer the question.

 *   Should endPos(n) return the offset AT the end of the n-th capturing
 group, thus enforcing the invariant matchedLength(n) = endPos(n) -
 startPos(n) + 1 and implying that a capturing group of of length 0
 returns endPos(n) = startPos(n) - 1 (which could seem strange on a
 first look)? Or do you prefer endPos(n) to return the offset plus
 one (i.e. immediately after the end of substring captured by the
 n-th group), having then matchedLength(n) = endPos(n) -
 startPos(n)? (3rd option: remove endPos(n) entirely)

How is this even a problem? Under which circumstances is the triad start,
length, end not holding?

endPos should be one after the last character matched, so that in all
circumstances
end = start + length

This holds for all containers, like QString, QByteArray, QVector, etc. If this
is difficult to visualise in the API, remove the end methods and keep only
start and length.

 Does a string exactly match a pattern?

   Version 1
  QString str(a string);
  bool matches = str.contains(QRegularExpression(\\Aa str\\w+\\z));
  // matches == true

A non-initiated like me might write ^a str\\w+$. I'd expect that to work
and, by default, ^ is the beginning of the string and $ the end. Note I did
not set MultilineOption.

  for (QRegularExpressionMatch match = re.match(str);
 match.hasMatch(); ++match)
  substrings  match.cap();

This one mixes STL-style methods (operator++) with Java-style ones. Either we
do:

for (match = re.match(); match != re.end(); ++match)

or we do:

match = re.match();
while (match.hasNext()) {
/* whatever */
match.next();
}

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] QRegularExpression -- first round of API review

2012-01-11 Thread Giuseppe D'Angelo
2012/1/11 Thiago Macieira thiago.macie...@intel.com:

Hi, thanks for the reply. :)

 First, you'll need to get a Nokian to import the PCRE sources. You cannot
 submit them to Gerrit (not even to the commit you made) because that violates
 the CLA. You're not the author. Please don't submit the PCRE code again --
 just pretend it's there.

No problem at all. We can wait until the 8.30 release and import that.
As a future-proofing question: a Nokian is needed to upgrade PCRE as
well, right?

Btw, I'm pushing more updated stuff on gitorius, including some method
documentation (I'm happier to push there since I can push even small
bugfixes without causing emails to be sent, sanity checks, etc)
https://qt.gitorious.org/~peppe/qt/peppes-qtbase/blobs/pcreregexp/src/corelib/tools/qregularexpression.h
https://qt.gitorious.org/~peppe/qt/peppes-qtbase/blobs/pcreregexp/src/corelib/tools/qregularexpression.cpp

 The API looks now a lot more digestible. There are still a few methods that I
 will need the documentation for, as I can't guess what they are from their
 name (subject is probably an RE term that I don't know).

Subject simply means the string you're matching your regexp
against, so there's no confusion when one talks about strings
(there are the pattern string and the subject string).

 The API around the
 captured texts may need a few more rounds of discussion. The name cap
 appeared in Qt 3 and if we're not able to keep source compatibility with Qt 4
 anyway, maybe it's time to fix it too.

Any suggestion is welcome (a verbose captured(int nth))? I could also rename
startPos / endPos to startPosition / endPosition.

 The iterating methods, which are the
 cool thing about this API, seem to be lost. I don't see how to get the
 contents of that match.

What do you mean?

 Specific questions:

     *   QRegularExpressionMatch::captureCount returns actually the highest
         index of a capturing group that matched something. Ideas?
         (lastCapturedIndex?)

 It seems that they are the same thing. captureCount looks fine if the other
 methods also have capture in the name. Does this return the number of named
 captures too? E.g. imagine I have two named captures in my RE and nothing
 else. If they match, will that return 2?

Yes, but because you get *three* capturing groups: the implicit group
0 (the whole match) and the two named ones. Therefore, the highest
index that captured is 2. If the second didn't capture, then
captureCount() would return 1. But if the first didn't capture and the
second did, then the returned value is still 2! The idea is that you
can use this value know to extract all the captured groups.

The long explaination is that, in Perl regexps, a named capturing
group has also the ordinary integer number, so you can get the two
captures with cap(1) and cap(2) as well as cap(first) and
cap(second).  This has interesting consequences with the branch
operator (|...) and/or with duplicated names.

My doubt was a bit more general: a method called captureCount
(although on the *match* class) could mislead people into thinking
that that's the number of capturing groups in the regular expression
itself, and not related about what's the index of the last capturing
group that matched.

 If my RE has a capture that is optional and fails to match, how do I find out?
 Imagine:

        rx = /(foo)?(bar)/
        rx =~ bar

 In this case, the first capture failed to match anything. How do I know that 
 in
 the API?

In that case cap(1) will return a null (not an empty!) QString. I can
add some convenience for that (hasCaptured()?).

 How is this even a problem? Under which circumstances is the triad start,
 length, end not holding?

The question was which triad should be holding?

 endPos should be one after the last character matched, so that in all
 circumstances
        end = start + length

 This holds for all containers, like QString, QByteArray, QVector, etc. If this
 is difficult to visualise in the API, remove the end methods and keep only
 start and length.

Ok, then I'll do it.

My point was that end = start + length implies that the actual ending
is one codepoint less than what the end value is.
F.i.: abc =~ /abc/
 - start = 0
 - length = 3
 - end = 3 (But the actual ending offset is 2. string.at(3) is even
out of bounds)

 Does a string exactly match a pattern?

   Version 1
      QString str(a string);
      bool matches = str.contains(QRegularExpression(\\Aa str\\w+\\z));
      // matches == true

 A non-initiated like me might write ^a str\\w+$. I'd expect that to work
 and, by default, ^ is the beginning of the string and $ the end. Note I did
 not set MultilineOption.

You're right about multiline -- without it, \A and ^ are the same
thing. In the general case you would use \A. You still need \z and not
$ though, because $ matches even a newline before the ending (that is,
a string\n matches ^a str\\w+$).

 This one mixes STL-style methods (operator++) with Java-style ones. Either 

Re: [Development] QRegularExpression -- first round of API review

2012-01-11 Thread Thiago Macieira
On Wednesday, 11 de January de 2012 17.53.17, Giuseppe D'Angelo wrote:
 2012/1/11 Thiago Macieira thiago.macie...@intel.com:

 Hi, thanks for the reply. :)

  First, you'll need to get a Nokian to import the PCRE sources. You cannot
  submit them to Gerrit (not even to the commit you made) because that
  violates the CLA. You're not the author. Please don't submit the PCRE
  code again -- just pretend it's there.

 No problem at all. We can wait until the 8.30 release and import that.
 As a future-proofing question: a Nokian is needed to upgrade PCRE as
 well, right?

You cannot submit any changes you have not authored / don't have the copyright
on. The exactly language of the CLA you can find in the CLA :-)

  The API looks now a lot more digestible. There are still a few methods
  that I will need the documentation for, as I can't guess what they are
  from their name (subject is probably an RE term that I don't know).

 Subject simply means the string you're matching your regexp
 against, so there's no confusion when one talks about strings
 (there are the pattern string and the subject string).

Ok. But then see below...

  The API around the
  captured texts may need a few more rounds of discussion. The name cap
  appeared in Qt 3 and if we're not able to keep source compatibility with
  Qt 4 anyway, maybe it's time to fix it too.

 Any suggestion is welcome (a verbose captured(int nth))? I could also
 rename startPos / endPos to startPosition / endPosition.

I'd also add captured somewhere in the names of those too.

  The iterating methods, which are the
  cool thing about this API, seem to be lost. I don't see how to get the
  contents of that match.

 What do you mean?

The are at the bottom, seemingly useless. They're lost in the sea of other
methods. I could not find other methods that operated on the iterated
captures. I only found the random-access captures (cap  family) and the
global result ones.

  Specific questions:
  *   QRegularExpressionMatch::captureCount returns actually the
  highest
  index of a capturing group that matched something. Ideas?
  (lastCapturedIndex?)
 
  It seems that they are the same thing. captureCount looks fine if the
  other
  methods also have capture in the name. Does this return the number of
  named captures too? E.g. imagine I have two named captures in my RE and
  nothing else. If they match, will that return 2?

 Yes, but because you get *three* capturing groups: the implicit group
 0 (the whole match) and the two named ones. Therefore, the highest
 index that captured is 2. If the second didn't capture, then
 captureCount() would return 1. But if the first didn't capture and the
 second did, then the returned value is still 2! The idea is that you
 can use this value know to extract all the captured groups.

Then call it lastCaptureIndex, I guess. The count would have to be 3.

 My doubt was a bit more general: a method called captureCount
 (although on the *match* class) could mislead people into thinking
 that that's the number of capturing groups in the regular expression
 itself, and not related about what's the index of the last capturing
 group that matched.

Keep the expression class with captureCount, as it's the number of groups in
the expression, not the number of captures succeeded.

 In that case cap(1) will return a null (not an empty!) QString. I can
 add some convenience for that (hasCaptured()?).

Convenience would be nice. Name TBD.

  endPos should be one after the last character matched, so that in all
  circumstances
 end = start + length
 
  This holds for all containers, like QString, QByteArray, QVector, etc. If
  this is difficult to visualise in the API, remove the end methods and
  keep only start and length.

 Ok, then I'll do it.

 My point was that end = start + length implies that the actual ending
 is one codepoint less than what the end value is.
 F.i.: abc =~ /abc/
  - start = 0
  - length = 3
  - end = 3 (But the actual ending offset is 2. string.at(3) is even
 out of bounds)

That is correct. end is one past the last last valid. It's the first that isn't
included.

 You're right about multiline -- without it, \A and ^ are the same
 thing. In the general case you would use \A. You still need \z and not
 $ though, because $ matches even a newline before the ending (that is,
 a string\n matches ^a str\\w+$).

But does it match a string\nfoo ?

 Right. I'll devise something then. I'm not particulary happy with
 either of them, because I can't imagine a good idea of having an
 operator== / operator!= on a match result, and hasNext is non
 trivial -- it actually requires doing the match. As of now this last
 one can be rewritten as

 match = re.match();
 while (match.hasMatch())  {
  /* ... */
 match.advanceMatch();
  }

 which is indeed not consistent with Java iterators naming.

I agree. The result being also iteratable is weird. If it's not too complex,
having a 

Re: [Development] QRegularExpression -- first round of API review

2012-01-11 Thread Thiago Macieira
On Wednesday, 11 de January de 2012 16.43.11, Thiago Macieira wrote:
  Subject simply means the string you're matching your regexp
  against, so there's no confusion when one talks about strings
  (there are the pattern string and the subject string).

 Ok. But then see below...

It didn't come up.

It was about subjectOffset. If the subject is the string being matched against,
then there is no offset.

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] QRegularExpression -- first round of API review

2011-12-16 Thread Oswald Buddenhagen
On Fri, Dec 16, 2011 at 12:07:26AM +, ext Giuseppe D'Angelo wrote:
 On 15 December 2011 19:45, Oswald Buddenhagen oswald.buddenha...@nokia.com 
 wrote:
      void setPatternOptions(PatternOptions options, bool on = true);
 
  i don't like that too much, because *un*setting options en masse sucks -
  you need to make a block of options to set, and a block of options
  to delete.
 I copied the setXXX(flags, bool on = true) from
 QPainter::setRenderHints, I was quite sure it's also used somewhere
 else. So are you suggesting a simple setPatternOptions(options) that
 simply sets (as in *assigns*) the argument as the current option set?
 
that's not what i meant. the problem with straight setters is that they
are not very forward-compatible: if something else sets some flags we
don't know about yet, we'll inadvertently reset them. therefore, you
need some api to un-/set flags selectively, either with a bool parameter
of a whole mask.
but then, this is not relevant for a regular expression matcher, so i
agree with andreS and thiago that it should be a simple setter.

regards
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


[Development] QRegularExpression -- first round of API review

2011-12-15 Thread Giuseppe D'Angelo
Hi everybody. Sorry for the length of thjis message, but doing API
reviews by mail is hard, and I needed to explain many decisions here and
there (and, of course, the API itself). :-(

Attached to this mail (and also here: http://pastebin.com/KzsGFXJC --
if you don't want to download and open a file) you can find a stripped
down version of the QRegularExpression API I'm working on. It's
deliberately stripped down because at this stage I'd like to have
feedbacks about the API and not about the code itself. We're still free
of changing all parts of it, so, even if you dislike the tiniest bit of
something, please go ahead and say it.

And it's deliberaltely without comments, so the reader's exercise now is
reading it and finding out if it's understandable at a first glance.
Once you're done, I'll start commenting the non-obvious points.

General
As discussed, two implicitly shared, copy on write, reentrant value
classes: the regexp itself (pattern + options) and the result of a match
against a subject string (fixes the T7 in
http://developer.qt.nokia.com/wiki/Regexp_engine_in_Qt5 ). The regexp
also stores the compiled form of itself as an hidden optimization --
most const methods actually do modify the shared instance. The backend
is PCRE so the feature set is quite big (fixing T1-T4). I have already
planned to expand the enums to support even more options.

QRegularExpression
  PatternOptions
The PatternOptions flag contains the pattern options, i.e. modify the
characters that the pattern should match.

*   CaseInsensitiveOption is the /i flag (case Insensitive Match). In
QRegExp it was set by a separate method (setCaseSensitivity),
holding a Qt::CaseSensitivity value. Here IMHO it just clutters the
API to have different getters and setters for the options.

*   InvertedGreedinessOption is the /U flag (not present in Perl)
inverts the greediness of the quantifiers: they became not greedy by
default, and greedy if followed by ?. See also the T3 in the list.

*   DotMatchesEverythingOption is the /s flag (Single line match): the
dot (.) metacharacter matches everything including newlines.

*   MultilineMatchOption is the /m flag (Multi line match): the caret
(^) and the dollar ($) metacharacters are allowed to match at (and
just before) any newline inside the string (as well as the ordinary
begin and end of the string)

*   ExtendedPatternSyntaxOption is the /x flag (eXtended pattern
syntax): whitespace data inside the regexp is ignored, and an
unescaped # outside a character class starts a comment that goes
till the next newline (inside the pattern!). It's probably not very
useful in C++, where one can use the default rules for string
literals, but it may be handy if one stores complicated regexps in
an external file.

*   UseUnicodePropertiesOption enables the use Unicode properties for
character classes such as \w, \s, \d, etc. (and obviously their
counterparts \W, \S, etc.), instead of making them just match the
standard ASCII ranges

*   AllowDuplicatedNamesOption allows duplicated named capturing
patterns (that is (?NAME ... ) patterns)

*   DontCaptureOption disables the use of unnamed capturing parenthesis
-- they all behave as if they all were (?:...) parenthesis. Named
ones continue to work.

*   MatchFirstlineOnlyOption restricts an unanchored pattern to start to
match only before the first newline (although the match can continue
over it).

*   DollarMatchesOnlyAtEndOption changes the meaning of the dollar ($)
metacharacter to match only at the very end of the string, while
usually it matches at the end AND at a newline right before the end

  CompileHints
The CompileHints are hints to the regexp engine itself. Basically they
control the optimization of the compiled regexp (the PCRE API is
pcre_study).

*   StudyPatternCompileHint tells PCRE engine to study the pattern,
which basically means performing very simple optimization tasks
(f.i., the engine records the minimum size of a string to get a
successful match, and if the subject string is shorter the match
aborts immediately)

*   JitCompileHint implies StudyPatternCompileHint, and enables the JIT
compilation of the pattern. Of course this has a cost, and there's a
tradeoff to consider between spending time compilating the regexp
instead of just matching against a string.

  MatchOptions
The MatchOptions control the match itself.

*   AnchoredMatch forces the match to start at the first matching point
inside the subject string, even if the regexp doesn't have carets or
other metacharacters that normally force such a 

Re: [Development] QRegularExpression -- first round of API review

2011-12-15 Thread joao.abecasis
Hi Giuseppe,

I'll start by saying tl;dr. But I didn't stop because of your e-mail, I'm 
actually referring to the API.

I started looking at it and it seems too cluttered. Specially this early in the 
process. It's hard to review something that is trying to be everything or maybe 
it's just exposing too many things.

I would like to challenge you to do the opposite: give us the minimum API that 
can offer the most important features. To get there start from short 
self-contained code examples showing the features in use (I'm interested in 
seeing those) -- API design is about how it gets used, not so much about the 
number of features.

For instance, in Perl, regular expressions are essentially a string, a couple 
of operators and some magic variables for the captures (and lots of magic 
everywhere...). (Now, I'm not saying we should do regular expressions in Qt as 
they are done in Perl)

The API, as it stands seems too hard-wired to the implementation or feature set 
of the engine giving it little opportunity for evolving. Another issue is that 
it is hard for me to see if (or where) the API itself is imposing performance 
penalties on the implementation.

Finally, note that it is usually ok to add API as time goes by, but our BC 
promises mean we have to maintain most of what we add for a long time.

Cheers,


João


From: development-bounces+joao.abecasis=nokia@qt-project.org 
[development-bounces+joao.abecasis=nokia@qt-project.org] on behalf of ext 
Giuseppe D'Angelo [dange...@gmail.com]
Sent: 15 December 2011 17:43
To: development@qt-project.org
Subject: [Development] QRegularExpression -- first round of API review

Hi everybody. Sorry for the length of thjis message, but doing API
reviews by mail is hard, and I needed to explain many decisions here and
there (and, of course, the API itself). :-(

Attached to this mail (and also here: http://pastebin.com/KzsGFXJC --
if you don't want to download and open a file) you can find a stripped
down version of the QRegularExpression API I'm working on. It's
deliberately stripped down because at this stage I'd like to have
feedbacks about the API and not about the code itself. We're still free
of changing all parts of it, so, even if you dislike the tiniest bit of
something, please go ahead and say it.

And it's deliberaltely without comments, so the reader's exercise now is
reading it and finding out if it's understandable at a first glance.
Once you're done, I'll start commenting the non-obvious points.

General
As discussed, two implicitly shared, copy on write, reentrant value
classes: the regexp itself (pattern + options) and the result of a match
against a subject string (fixes the T7 in
http://developer.qt.nokia.com/wiki/Regexp_engine_in_Qt5 ). The regexp
also stores the compiled form of itself as an hidden optimization --
most const methods actually do modify the shared instance. The backend
is PCRE so the feature set is quite big (fixing T1-T4). I have already
planned to expand the enums to support even more options.

QRegularExpression
  PatternOptions
The PatternOptions flag contains the pattern options, i.e. modify the
characters that the pattern should match.

*   CaseInsensitiveOption is the /i flag (case Insensitive Match). In
QRegExp it was set by a separate method (setCaseSensitivity),
holding a Qt::CaseSensitivity value. Here IMHO it just clutters the
API to have different getters and setters for the options.

*   InvertedGreedinessOption is the /U flag (not present in Perl)
inverts the greediness of the quantifiers: they became not greedy by
default, and greedy if followed by ?. See also the T3 in the list.

*   DotMatchesEverythingOption is the /s flag (Single line match): the
dot (.) metacharacter matches everything including newlines.

*   MultilineMatchOption is the /m flag (Multi line match): the caret
(^) and the dollar ($) metacharacters are allowed to match at (and
just before) any newline inside the string (as well as the ordinary
begin and end of the string)

*   ExtendedPatternSyntaxOption is the /x flag (eXtended pattern
syntax): whitespace data inside the regexp is ignored, and an
unescaped # outside a character class starts a comment that goes
till the next newline (inside the pattern!). It's probably not very
useful in C++, where one can use the default rules for string
literals, but it may be handy if one stores complicated regexps in
an external file.

*   UseUnicodePropertiesOption enables the use Unicode properties for
character classes such as \w, \s, \d, etc. (and obviously their
counterparts \W, \S, etc.), instead of making them just match the
standard ASCII ranges

*   AllowDuplicatedNamesOption allows

Re: [Development] QRegularExpression -- first round of API review

2011-12-15 Thread Giuseppe D'Angelo
On 15 December 2011 19:45, Oswald Buddenhagen
oswald.buddenha...@nokia.com wrote:
 On Thu, Dec 15, 2011 at 04:43:49PM +, ext Giuseppe D'Angelo wrote:
    pos, matchedLength, endPos

 inconsistent naming

Well, pos and matchedLength come straight from QRegExp and I kept
them. But please, any suggestion is welcome! (I'm not even a native
English speaker). (In case it isn't clear: endPos() == pos() +
matchedLength() - 1)

     void setPatternOptions(PatternOptions options, bool on = true);

 i don't like that too much, because *un*setting options en masse sucks -
 you need to make a block of options to set, and a block of options
 to delete.
I copied the setXXX(flags, bool on = true) from
QPainter::setRenderHints, I was quite sure it's also used somewhere
else. So are you suggesting a simple setPatternOptions(options) that
simply sets (as in *assigns*) the argument as the current option set?

 that's half as bad if all options are off by default, but
 then one has to wonder whether a way to unset them actually makes sense
 in the first place.

Yes, all options are off by default. I don't see any particular
problem with turning an option on and then off again. What do you
think about this?

 fwiw, the usual elegant solution is having a value and a mask parameter.
 the mask could have two magic values meaning un-/set all asserted in
 vlaue to mean effectively what your bool means. the default argument
 would be the magic value for setting, so the standard syntax would set
 all flags named in the first argument.

I'm not sure I understand this point completely -- isn't that what the
method does? I.e.
  rx.setPatternOptions(CaseInsensitiveOption | DotMatchesEverythingOption)
turns on those two options, and leaves the others unchanged.

Thanks for the feedback.

Cheers,
-- 
Giuseppe D'Angelo
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] QRegularExpression -- first round of API review

2011-12-15 Thread Thiago Macieira
On Thursday, 15 de December de 2011 22.53.19, joao.abeca...@nokia.com wrote:
 Hi Giuseppe,

 I'll start by saying tl;dr. But I didn't stop because of your e-mail, I'm
 actually referring to the API.

Hi as well Giuseppe

I did read most of your email :-) Thanks for the effort so far.

I'd like to start by saying I agree with Ossi: the test/set way of setting
flags is un-Qt-ish. I know it exists in a few places, but they are the vast
minority. I'd prefer a regular pair of getter and setter on the QFlags type.

 I started looking at it and it seems too cluttered. Specially this early in
 the process. It's hard to review something that is trying to be everything
 or maybe it's just exposing too many things.

João is also right: it seems you're trying to provide all the power of PCRE to
the user in the first go. It's  good that you're exploring everything, so we
know which way we may go in the future, but I think we can trim down on the
features at the first iteration of the API.

The next thing that I find weird is the set of match functions. My first
reaction was to ask you to call them indexIn, but since they don't return an
index but a match object, the name is fine. But still, do we need a match,
partialMatch and exactMatch? Also note that the boolean in partialMatch is
also un-Qt-ish, so you'll need to replace it with an enum as well. If you
do, you may as well merge all three functions.

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
 Intel Sweden AB - Registration Number: 556189-6027
 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] QRegularExpression -- first round of API review

2011-12-15 Thread Giuseppe D'Angelo
On 15 December 2011 22:53,  joao.abeca...@nokia.com wrote:
 Hi Giuseppe,

Hi João,
thanks for the comments.

 I'll start by saying tl;dr. But I didn't stop because of your e-mail, I'm 
 actually referring to the API.

 I started looking at it and it seems too cluttered. Specially this early in 
 the process. It's hard to review something that is trying to be everything or 
 maybe it's just exposing too many things.

 I would like to challenge you to do the opposite: give us the minimum API 
 that can offer the most important features. To get there start from short 
 self-contained code examples showing the features in use (I'm interested in 
 seeing those) -- API design is about how it gets used, not so much about the 
 number of features.

Then we should discuss what those important features are. In my mail
I pointed out how the API is supposed to fix the T1-T7 issues, but I
can understand not all of them have the same priority.

(Other sub-objectives are 1) keeping the QRegExp feature set 2) keep
existing stuff working; but I guess that if QRegExp gets moved in a
stand-alone module, we simply can make other modules in Qt depend on
that one until we have a replacement, and thus not breaking anything.)

I'd like to collect feedbacks about these priorities, especially from
the intended users of the API! If you have the chance, please tell
developers to give their opinions.

Continuing: you're totally right about the missing examples. I'll
write some down and then post them.

 For instance, in Perl, regular expressions are essentially a string, a couple 
 of operators and some magic variables for the captures (and lots of magic 
 everywhere...). (Now, I'm not saying we should do regular expressions in Qt 
 as they are done in Perl)

That's basically the same here, but there are also options to control
the optimization level (you may not want to lose time optimizizing,
esp. for quick, one-shot matches; or prefer to optimize in case you
have to filter a model of 100k+ strings) and the match behaviour (some
options are actually implemented in Perl -- see what I wrote about the
/g algorithm -- but they are not avaiable to the user).

At high level:
- QRegularExpression is the pattern+options, as in Perl's /pattern/options.
- QRegularExpressionMatch is the rough equivalent of Perl's capturing variables:
  - $ - cap(0)
  - $1...$n - cap(n)
  - $+[n] - pos(n)
  - $-[n] - endPos(n)
  - $+{str} - capturedText(str)
  - $-{str} - capturedTextsForName(str)
 etc.

Plus, the match object is used to implement /g, that is, to repeatedly
match on a string (as convenience API).

 The API, as it stands seems too hard-wired to the implementation or feature 
 set of the engine giving it little opportunity for evolving.

Where do you feel I could improve this? F.i. removing some enum is
easy, but I guess the problem isn't there.

 Another issue is that it is hard for me to see if (or where) the API itself 
 is imposing performance penalties on the implementation.

The biggest hit as of now is obviously the conversions all around the
place from/to UTF-8, which have a memory and time impact. This will
eventually be solved when PCRE ships with native UTF-16 support (see
Stephen Kelly's mail about this). Apart from that the implementation
is almost straightforward -- just call the method in PCRE that does
the work and grab the result.

The other hit I talked about is inside the implementation of iterating
backwards (and lastMatch). I cannot see any good general way to
implement it, apart from iterating forward from the beginning (and
eventually caching such results. A match result in terms of captured
strings is quite small -- we just keep the offsets inside the
subject). The fact is that matching backwards is simply something a
bit outside the regexp world.

Cheers,
--
Giuseppe D'Angelo
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] QRegularExpression -- first round of API review

2011-12-15 Thread Giuseppe D'Angelo
2011/12/16 Thiago Macieira thiago.macie...@intel.com:
 I did read most of your email :-) Thanks for the effort so far.

Hero :-) Thank you for reading!

 I'd like to start by saying I agree with Ossi: the test/set way of setting
 flags is un-Qt-ish. I know it exists in a few places, but they are the vast
 minority. I'd prefer a regular pair of getter and setter on the QFlags type.

No problem. Will do.

 I started looking at it and it seems too cluttered. Specially this early in
 the process. It's hard to review something that is trying to be everything
 or maybe it's just exposing too many things.

 João is also right: it seems you're trying to provide all the power of PCRE to
 the user in the first go. It's  good that you're exploring everything, so we
 know which way we may go in the future, but I think we can trim down on the
 features at the first iteration of the API.

Ok, see the other message about this.

 The next thing that I find weird is the set of match functions. My first
 reaction was to ask you to call them indexIn, but since they don't return an
 index but a match object, the name is fine. But still, do we need a match,
 partialMatch and exactMatch?

* match: I think it should stay :-)

* exactMatch: it's convenience, and it's there... because QRegExp
offered it (although there it was handy to perform partial matching).
As I wrote, an user can get the very same result by wrapping a pattern
between \A and \z (and that's more or less how it would be
implemented). It's just a matter of deciding if we want it or not. Up
to the consensus.

* lastMatch (again, being there because QRegExp offered it) this is
tricky to implement correctly and efficiently, and needs the same
discussion about iterating backwards.

* partialMatch: the hard version may be removed, for now, in favour
of adding, later, a more general way to perform incremental matching
over non contiguous chunks of data (and iterating, as in /g, over such
matches). The soft version might stay there for validators, if you
feel that its behaviour is the right one.

 Also note that the boolean in partialMatch is
 also un-Qt-ish, so you'll need to replace it with an enum as well. If you
 do, you may as well merge all three functions.

Surely I can merge match, exactMatch and partialMatch in only one
method, with an enum to control the behaviour.
OTOH lastMatch may require a different offset (from the *end* of the
string), so I'm not sure if merging it as well.

Cheers,
-- 
Giuseppe D'Angelo
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] QRegularExpression -- first round of API review

2011-12-15 Thread Andre Somers
Op 16-12-2011 1:07, Giuseppe D'Angelo schreef:

 fwiw, the usual elegant solution is having a value and a mask parameter.
 the mask could have two magic values meaning un-/set all asserted in
 vlaue to mean effectively what your bool means. the default argument
 would be the magic value for setting, so the standard syntax would set
 all flags named in the first argument.
 I'm not sure I understand this point completely -- isn't that what the
 method does? I.e.
rx.setPatternOptions(CaseInsensitiveOption | DotMatchesEverythingOption)
 turns on those two options, and leaves the others unchanged.

Really? Now to me _that_ is unexpected behaviour. I would have expected 
that this method call would set the options to exactly 
(CaseInsensitiveOption | DotMatchesEverythingOption), no matter what the 
previous value of the matchingOptions value was.

André
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development