[
https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557988#action_12557988
]
Steven Rowe commented on LUCENE-1126:
-------------------------------------
In part my imprecise characterization of the process comes from what is likely
a misunderstanding of the Lucene-Java release process - when you said:
bq. I'm not positive, but couldn't this result in situations where a committer
using a 1.5 JVM could generate and commit a StandardTokenizerImpl.java that had
would have a different behavior then if he was using 1.4 - all of which would
be completely independent of whether or not the release engineer of the next
release compiled the resulting grammer using 1.4?
I assumed you meant that during the release process, the lexical scanner source
(.java file) would be regenerated from the grammar (.jflex file). And in this
scenario, I meant to refer to "compile-time" as the entire build process - raw
source to jar assembly, *including* lexical scanner generation - undertaken
when producing a binary release.
But of course you're right :) . The JVM version being used during
source-generation-time (occurring prior to, and potentially not contiguously
with, bytecode-generation-time) determines the version of Unicode used to
define the meaning of "letter" and "digit".
> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
> Key: LUCENE-1126
> URL: https://issues.apache.org/jira/browse/LUCENE-1126
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 2.2
> Reporter: Steven Rowe
> Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a
> question on Korean ranges" begun by Daniel Noll on java-user, and carried
> over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information. Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are. I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class
> [:digit:], which has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as
> java.lang.Character.isLetter()) includes CJK characters, there is a way to
> handle this using JFlex's regex negation syntax {{!}}. From [the JFlex
> documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is
> !(!{{a}}|{{b}})
> So to exclude CJ characters from the LETTER macro:
> {code}
> LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>
> Since [:letter:] includes all of the Korean ranges, there's no reason
> (AFAICT) to treat them separately; unlike Chinese and Japanese characters,
> which are individually tokenized, the Korean characters should participate in
> the same token boundary rules as all of the other letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2
> supports, and Unicode 5.0, the latest version, and there are lots of new and
> modified letter and digit ranges. This stuff gets tweaked all the time, and
> I don't think Lucene should be in the business of trying to track it, or take
> a position on which Unicode version users' data should conform to.
> Switching to using JFlex's [:letter:] and [:digit:] predefined character
> classes ties (most of) these decisions to the user's choice of JVM version,
> and this seems much more reasonable to me than the current status quo.
> I will attach a patch shortly.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]