[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

David Smiley (JIRA) Tue, 03 Nov 2015 18:56:29 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14988791#comment-14988791
 ]


David Smiley commented on LUCENE-6874:
--------------------------------------

Jack:  My use-case since you asked:  I've got a document store of content in 
XML that provides various markup around mostly text.  These documents 
occasionally have an NBSP.  I process it outside of Solr to produce the text I 
want indexed/stored -- it's not XML any more.  An NBSP entity, if found, is 
converted to the NBSP character naturally as part of Java's XML libraries (no 
explicit decision on my part).

bq. Implicitly then, you're nixing ICUWhitespaceTokenizer, since it can't be in 
analyzers-common.

Right; ah well.

RE what to name the attribute:  I suggest "definition" or even better: "rule" 
(or "ruleset")

I do think the first-line sentence of these whitespace tokenizers should point 
to what definition of whitespace is chosen.  And that they reference each other 
so that anyone stumbling on them will know of the other.

RE WDF:  I prefer WhitespaceTokenizer with WDF for not just product-id data but 
also full-text.  Full-text might contain product-ids, or have things like 
"wi-fi" and many other words, like say "thread-safe" or "co-worker" that are 
sometimes hyphenated, sometimes not; some of these might be space-separated; 
etc..  WDF is very flexible but if you use a Tokenizer like Standard* or 
Classic* then hyphen will be pre-tokenized before WDF can do its thing, 
neutering part of its benefit.  I wish WDF kept payloads and other attributes; 
but it's not the only offender here, and likewise for the bigger issue of 
positionLength.  Otherwise I'm a WDF fan :-)  Nonetheless I like some of Jack's 
ideas on a better tokenizer that subsumes WDF.

BTW, FWIW if I had to write a WhitespaceTokenizer from scratch, I'd implement 
it as a bitset for characters < 65k (this is 8KB memory).  For the remainder 
I'd use an array that is scanned; but it appears there are none beyond 65k as I 
look at a table of these char's from a quick google search.  Then a 
configurable definition loader could fill named whitespace rules and it might 
be configurable to add or remove certain codes.  But no need to bother; Steve's 
impl is fine :-)

> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>         Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Reply via email to