OK I think likely this is a bug in RAS. And we are just seeing the difference in how Oracle's & IBM's JREs handle an unpaired surrogate...
Lemme work out a patch... Mike On Mon, Jul 26, 2010 at 4:13 PM, Michael McCandless <luc...@mikemccandless.com> wrote: > Yeah that char is a high surrogate which is unpaired, which is no good > -- it's invalid. Cool, though, that Google puts us first when you > search on this character :) > > Can you figure out how that bad string was created? That "if > (random.nextBoolean())" either creates the string randomly (which > should never return unpaired surrogate), or, calls > RandomAcceptedString.getRandomAcceptedString... maybe the bug is in > RAS. > > Mike > > On Mon, Jul 26, 2010 at 3:41 PM, Shai Erera <ser...@gmail.com> wrote: >> From here: http://www.fileformat.info/info/unicode/char/d9ff/index.htm >> >> Looks like that character is not a valid Unicode character, and perhaps the >> IBM's JVM behaves correctly? Robert - you're the Unicode expert :). >> >> Shai >> >> On Mon, Jul 26, 2010 at 10:40 PM, Shai Erera <ser...@gmail.com> wrote: >>> >>> I don't know what was the thing w/ the strings generated before, but now I >>> ran the test again w/ the same seed and it generates the same strings. So at >>> least it seems there are no problems w/ the Random class :). >>> >>> However, the string l.E fails w/ the IBM JVM and succeeds w/ SUN's. Any >>> ideas why? What does the test check anyway? >>> >>> I ran TRR2, and set the regexp to always be "l.E" and the test passes. The >>> failure comes from >>> >>> junit.framework.AssertionFailedError: expected:<true> but was:<false> >>> at >>> org.apache.lucene.util.automaton.TestUTF32ToUTF8.assertAutomaton(TestUTF32ToUTF8.java:199) >>> at >>> org.apache.lucene.util.automaton.TestUTF32ToUTF8.testRandomRegexes(TestUTF32ToUTF8.java:171) >>> >>> I've set regexp to "l.E", and also 'string' inside assertAutomaton to >>> "\u006C\uD9FF\u0045". The byte[] returned from string.getBytes("UTF-8") are >>> [108, 69]. It just ignores the middle character. Perhaps that's why the test >>> fails? >>> >>> When I run this w/ SUN's JVM, the bytes returned are [108, 63, 69]. >>> >>> If I manually set the bytes, using IBM's, to [108, 63, 69], then the test >>> passes. >>> >>> Interestingly, Googling for \uD9FF brings back LUCENE-2019 as the first >>> result :). I'll dig some more into this character, and why the IBM and SUN >>> JVMs return different byte[] representation for the same sequence of >>> characters. If you already spot the problem, please let me know. >>> >>> BTW, the test calls _TestUtil.getRandomMultiplier on every iteration loop, >>> which goes and checks a system property. Perhaps we can extract it to a >>> variable, or include a static constant in LuceneTestCase(J4) or something? >>> >>> Shai >>> >>> On Mon, Jul 26, 2010 at 9:22 PM, Robert Muir <rcm...@gmail.com> wrote: >>>> >>>> maybe there is a bug in ibm's random generator :) >>>> >>>> On Mon, Jul 26, 2010 at 11:50 AM, Michael McCandless >>>> <luc...@mikemccandless.com> wrote: >>>>> >>>>> That's VERY spooky that w/ a fixed seed you see different random >>>>> regexps being made. >>>>> >>>>> Mike >>>>> >>>>> On Mon, Jul 26, 2010 at 11:40 AM, Shai Erera <ser...@gmail.com> wrote: >>>>> > Ok I've dug deeper into the test. I set the random seed to >>>>> > -9029631602016965389L in setUp(), and discovered that on the 4th >>>>> > iteration >>>>> > it breaks. For some reason though, AutomatonTestUtil.randomRegex >>>>> > generates >>>>> > different strings every time I run the test, even though it uses the >>>>> > same >>>>> > Random object w/ the same seed ... >>>>> > >>>>> > Anyway, one of the regex that failed was this "l.E" (w/o the quotes) >>>>> > and I >>>>> > think it's a lowercase L, '.' (dot) and 'E' (uppercase). Hope this >>>>> > helps. >>>>> > >>>>> > Shai >>>>> > >>>>> > On Mon, Jul 26, 2010 at 6:23 PM, Robert Muir <rcm...@gmail.com> wrote: >>>>> >> >>>>> >> sounds nasty... its good you are running the tests with this >>>>> >> different >>>>> >> jvm... >>>>> >> >>>>> >> On Mon, Jul 26, 2010 at 11:21 AM, Shai Erera <ser...@gmail.com> >>>>> >> wrote: >>>>> >>> >>>>> >>> Tried to run it w/ SUN JRE6 and it succeeds ! I've tried several >>>>> >>> times >>>>> >>> and it succeeds every time. However, when I revert back to IBM's, it >>>>> >>> fail >>>>> >>> immediately. >>>>> >>> >>>>> >>> I can help w/ the debug, if you give me a hint where to look :). >>>>> >>> >>>>> >>> Shai >>>>> >>> >>>>> >>> On Mon, Jul 26, 2010 at 5:57 PM, Shai Erera <ser...@gmail.com> >>>>> >>> wrote: >>>>> >>>> >>>>> >>>> Sorry for the delayed response. >>>>> >>>> >>>>> >>>> I ran it a couple more times, from Eclipse and Ant, and each time >>>>> >>>> it >>>>> >>>> fails (amazing !), w/ different seeds. More seeds that fail: >>>>> >>>> NOTE: random seed of testcase 'testRandomRegexes' was: >>>>> >>>> -4244174191361080127 >>>>> >>>> NOTE: random seed of testcase 'testRandomRegexes' was: >>>>> >>>> -7059086272401721644 >>>>> >>>> NOTE: random seed of testcase 'testRandomRegexes' was: >>>>> >>>> -1314734215611104147 >>>>> >>>> >>>>> >>>> I use IBM JVM, tried w/ both 1.5 and 1.6 ... >>>>> >>>> >>>>> >>>> Mike, can we use LUCENE-2565 to track this, or would you prefer >>>>> >>>> that I >>>>> >>>> open a separate one? >>>>> >>>> >>>>> >>>> Shai >>>>> >>>> >>>>> >>>> On Mon, Jul 26, 2010 at 3:26 PM, Michael McCandless >>>>> >>>> <luc...@mikemccandless.com> wrote: >>>>> >>>>> >>>>> >>>>> On a more general note... >>>>> >>>>> >>>>> >>>>> Any time any of you out there hit an "odd" test failure, please >>>>> >>>>> please >>>>> >>>>> please do just what Shai did: take it to the dev list! >>>>> >>>>> >>>>> >>>>> Think of Lucene's unit tests like SETI :) We are desperately >>>>> >>>>> seeking >>>>> >>>>> bugs, and you and your machine may just be lucky enough to find >>>>> >>>>> one... >>>>> >>>>> go forth and buy expensive new power hungry computers just so you >>>>> >>>>> can >>>>> >>>>> run the random tests over and over, seeking the bugs! >>>>> >>>>> >>>>> >>>>> But be sure to include that random seed when you do hit a >>>>> >>>>> failure... >>>>> >>>>> >>>>> >>>>> Mike >>>>> >>>>> >>>>> >>>>> On Mon, Jul 26, 2010 at 8:23 AM, Robert Muir <rcm...@gmail.com> >>>>> >>>>> wrote: >>>>> >>>>> > I agree, Shai can you open a bug? I cannot reproduce, did you >>>>> >>>>> > use an >>>>> >>>>> > IBM JVM >>>>> >>>>> > or another environment that might help us figure it out? >>>>> >>>>> > >>>>> >>>>> > On Mon, Jul 26, 2010 at 6:29 AM, Michael McCandless >>>>> >>>>> > <luc...@mikemccandless.com> wrote: >>>>> >>>>> >> >>>>> >>>>> >> Hmmm this means a bug is lurking. This is the power of random >>>>> >>>>> >> testing >>>>> >>>>> >> (that every time we all run tests, we're testing different >>>>> >>>>> >> "paths" >>>>> >>>>> >> through the code).... >>>>> >>>>> >> >>>>> >>>>> >> It seems exceptionally unlikely that LUCENE-2537's changes >>>>> >>>>> >> would >>>>> >>>>> >> cause >>>>> >>>>> >> this! >>>>> >>>>> >> >>>>> >>>>> >> But, unfortunately, when I plug that seed in I don't see it >>>>> >>>>> >> fail, >>>>> >>>>> >> which is odd. I'll run a stress test to see if I can tickle >>>>> >>>>> >> the >>>>> >>>>> >> bug... can you open a Jira issue so we don't lose track? >>>>> >>>>> >> >>>>> >>>>> >> Mike >>>>> >>>>> >> >>>>> >>>>> >> On Mon, Jul 26, 2010 at 2:57 AM, Shai Erera <ser...@gmail.com> >>>>> >>>>> >> wrote: >>>>> >>>>> >> > Hi >>>>> >>>>> >> > >>>>> >>>>> >> > I was running tests on trunk (after merging the changes from >>>>> >>>>> >> > LUCENE-2537) >>>>> >>>>> >> > and received this error message: >>>>> >>>>> >> > >>>>> >>>>> >> > expected:<true> but was:<false> >>>>> >>>>> >> > >>>>> >>>>> >> > junit.framework.AssertionFailedError: expected: but was: >>>>> >>>>> >> > at >>>>> >>>>> >> > >>>>> >>>>> >> > >>>>> >>>>> >> > >>>>> >>>>> >> > org.apache.lucene.util.automaton.TestUTF32ToUTF8.assertAutomaton(TestUTF32ToUTF8.java:197) >>>>> >>>>> >> > at >>>>> >>>>> >> > >>>>> >>>>> >> > >>>>> >>>>> >> > >>>>> >>>>> >> > org.apache.lucene.util.automaton.TestUTF32ToUTF8.testRandomRegexes(TestUTF32ToUTF8.java:170) >>>>> >>>>> >> > at >>>>> >>>>> >> > >>>>> >>>>> >> > >>>>> >>>>> >> > org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:285) >>>>> >>>>> >> > >>>>> >>>>> >> > NOTE: random seed of testcase 'testRandomRegexes' was: >>>>> >>>>> >> > 3510820306304573866 >>>>> >>>>> >> > >>>>> >>>>> >> > I'm sure it's related to my changes. Has anyone else seen >>>>> >>>>> >> > this >>>>> >>>>> >> > before? >>>>> >>>>> >> > >>>>> >>>>> >> > Shai >>>>> >>>>> >> > >>>>> >>>>> >> >>>>> >>>>> >> >>>>> >>>>> >> >>>>> >>>>> >> --------------------------------------------------------------------- >>>>> >>>>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>> >>>>> >> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>> >>>>> >> >>>>> >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> >>>>> > -- >>>>> >>>>> > Robert Muir >>>>> >>>>> > rcm...@gmail.com >>>>> >>>>> > >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> >>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>> >>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>> >>>>> >>>>> >>>> >>>>> >>> >>>>> >> >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Robert Muir >>>>> >> rcm...@gmail.com >>>>> > >>>>> > >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>> >>>> >>>> >>>> >>>> -- >>>> Robert Muir >>>> rcm...@gmail.com >>> >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org