[ 
https://issues.apache.org/jira/browse/LUCENE-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743926#action_12743926
 ] 

Paul Cowan commented on LUCENE-1813:
------------------------------------

Very very minor thing, but does it make more sense to choose a more suitable 
character? U+0001 is an assigned character, with some semantic meaning ("Start 
of Heading", same as ASCII character 0x01) which isn't really relevant to this 
use. It mightn't be a bad idea to (a) choose a control character which makes 
sense in context, if there is one (I can't see one, myself), (b) using a 
character from the private-use area (U+E000 to U+F8FF) or (c) my preferred 
option, using the Unicode tag characters. The tag characters are designed for 
just such a purpose.. embedding contextual metadata in text fields. The general 
syntax for a tag is <TAG TYPE> followed by one or more <TAG CHARACTER>s. 
Unfortunately, only one tag type is defined in unicode at present (language 
tag), which isn't suitable.

That said, I think it makes sense (and is probably 'nicer') to pick one of the 
Unicode tag characters -- say, U+E0052 TAG LATIN CAPITAL LETTER R (for 
'reverse') and use that. This could lead to a de facto standard for Lucene 
fields, where different variations of the same token could use different 
leading tag characters. Rather than just everyone picking a character at 
random, this could lead to some sort of structure around similar situations 
(i.e. I could envisage a filter which uses U+E004E TAG LATIN CAPITAL LETTER N 
for a normalised version of the token, etc). 

Sorry, I'm really anal about Unicode. Can't help it.

> Add option to ReverseStringFilter to mark reversed tokens
> ---------------------------------------------------------
>
>                 Key: LUCENE-1813
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1813
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Andrzej Bialecki 
>            Assignee: Robert Muir
>             Fix For: 2.9
>
>         Attachments: reverseMark-2.patch, reverseMark.patch
>
>
> This patch implements additional functionality in the filter to "mark" 
> reversed tokens with a special marker character (Unicode 0001). This is 
> useful when indexing both straight and reversed tokens (e.g. to implement 
> efficient leading wildcards search).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to