[ https://issues.apache.org/jira/browse/LUCENE-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743926#action_12743926 ]
Paul Cowan commented on LUCENE-1813: ------------------------------------ Very very minor thing, but does it make more sense to choose a more suitable character? U+0001 is an assigned character, with some semantic meaning ("Start of Heading", same as ASCII character 0x01) which isn't really relevant to this use. It mightn't be a bad idea to (a) choose a control character which makes sense in context, if there is one (I can't see one, myself), (b) using a character from the private-use area (U+E000 to U+F8FF) or (c) my preferred option, using the Unicode tag characters. The tag characters are designed for just such a purpose.. embedding contextual metadata in text fields. The general syntax for a tag is <TAG TYPE> followed by one or more <TAG CHARACTER>s. Unfortunately, only one tag type is defined in unicode at present (language tag), which isn't suitable. That said, I think it makes sense (and is probably 'nicer') to pick one of the Unicode tag characters -- say, U+E0052 TAG LATIN CAPITAL LETTER R (for 'reverse') and use that. This could lead to a de facto standard for Lucene fields, where different variations of the same token could use different leading tag characters. Rather than just everyone picking a character at random, this could lead to some sort of structure around similar situations (i.e. I could envisage a filter which uses U+E004E TAG LATIN CAPITAL LETTER N for a normalised version of the token, etc). Sorry, I'm really anal about Unicode. Can't help it. > Add option to ReverseStringFilter to mark reversed tokens > --------------------------------------------------------- > > Key: LUCENE-1813 > URL: https://issues.apache.org/jira/browse/LUCENE-1813 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Andrzej Bialecki > Assignee: Robert Muir > Fix For: 2.9 > > Attachments: reverseMark-2.patch, reverseMark.patch > > > This patch implements additional functionality in the filter to "mark" > reversed tokens with a special marker character (Unicode 0001). This is > useful when indexing both straight and reversed tokens (e.g. to implement > efficient leading wildcards search). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org