[ 
https://issues.apache.org/jira/browse/LUCENE-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743937#action_12743937
 ] 

Paul Cowan commented on LUCENE-1813:
------------------------------------

OK, cool. I'm taking an interest in this purely because I have some ideas for 
other token filters which would do something similar, and really like the idea 
of tagging them in the same way just with different 'headers'. It would be 
really beneficial, I think, to come up with something that can be reused and, 
more importantly, combined (so different filters don't 'clash' with their 
output). What about making it 2 characters, at least? 

U+0001 START OF HEADER
U+xxxx whatever you like to indicate 'reversing' (i.e. an 'R', or just a 0-byte 
as this is the first purpose allocated, or whatever)

This adds 2 bytes to each term, not 1, but terms generally don't take up that 
much room in the scale of a whole index and I think it's worth the flexibility. 
Hell, if you're willing to use 3 (that IS starting to seem wasteful, I admit) 
then maybe

U+0001 START OF HEADER
U+xxxx whatever
U+0002 START OF TEXT

That's at least semantically meaningful. Other ideas, just looking at the ASCII 
control characters:

U+xxxx whatever
U+001F UNIT SEPARATOR

or

U+000E SHIFT OUT
U+xxxx whatever
U+000F SHIFT IN

I don't really mind, but it's always nice to plan ahead.

> Add option to ReverseStringFilter to mark reversed tokens
> ---------------------------------------------------------
>
>                 Key: LUCENE-1813
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1813
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Andrzej Bialecki 
>            Assignee: Robert Muir
>             Fix For: 2.9
>
>         Attachments: reverseMark-2.patch, reverseMark.patch
>
>
> This patch implements additional functionality in the filter to "mark" 
> reversed tokens with a special marker character (Unicode 0001). This is 
> useful when indexing both straight and reversed tokens (e.g. to implement 
> efficient leading wildcards search).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to