[
https://issues.apache.org/jira/browse/LUCENE-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743926#action_12743926
]
Paul Cowan commented on LUCENE-1813:
------------------------------------
Very very minor thing, but does it make more sense to choose a more suitable
character? U+0001 is an assigned character, with some semantic meaning ("Start
of Heading", same as ASCII character 0x01) which isn't really relevant to this
use. It mightn't be a bad idea to (a) choose a control character which makes
sense in context, if there is one (I can't see one, myself), (b) using a
character from the private-use area (U+E000 to U+F8FF) or (c) my preferred
option, using the Unicode tag characters. The tag characters are designed for
just such a purpose.. embedding contextual metadata in text fields. The general
syntax for a tag is <TAG TYPE> followed by one or more <TAG CHARACTER>s.
Unfortunately, only one tag type is defined in unicode at present (language
tag), which isn't suitable.
That said, I think it makes sense (and is probably 'nicer') to pick one of the
Unicode tag characters -- say, U+E0052 TAG LATIN CAPITAL LETTER R (for
'reverse') and use that. This could lead to a de facto standard for Lucene
fields, where different variations of the same token could use different
leading tag characters. Rather than just everyone picking a character at
random, this could lead to some sort of structure around similar situations
(i.e. I could envisage a filter which uses U+E004E TAG LATIN CAPITAL LETTER N
for a normalised version of the token, etc).
Sorry, I'm really anal about Unicode. Can't help it.
> Add option to ReverseStringFilter to mark reversed tokens
> ---------------------------------------------------------
>
> Key: LUCENE-1813
> URL: https://issues.apache.org/jira/browse/LUCENE-1813
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Andrzej Bialecki
> Assignee: Robert Muir
> Fix For: 2.9
>
> Attachments: reverseMark-2.patch, reverseMark.patch
>
>
> This patch implements additional functionality in the filter to "mark"
> reversed tokens with a special marker character (Unicode 0001). This is
> useful when indexing both straight and reversed tokens (e.g. to implement
> efficient leading wildcards search).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]