Re: question on custom filter

Robert Muir Mon, 20 Jul 2009 12:50:01 -0700

Obender, does the following text appear like the image in the link, or not?


שומר אחי

http://farm1.static.flickr.com/3/10445435_75b4546703.jpg?v=0


On Mon, Jul 20, 2009 at 3:34 PM, OBender<[email protected]> wrote:
> I've checked, and it appears to be enabled.
>
> -----Original Message-----
> From: Robert Muir [mailto:[email protected]]
> Sent: Monday, July 20, 2009 3:18 PM
> To: [email protected]
> Subject: Re: question on custom filter
>
> Obender, based on your previous comments (that you see text displayed
> in the wrong order), I again recommend that you enable support for RTL
> languages in your operating system, as I mentioned earlier... are you
> using a Windows-based OS, this is not enabled by default!
>
> I think you are seeing things in the incorrect order, and this is
> causing confusion for you!
>
> On Mon, Jul 20, 2009 at 3:02 PM, Robert Muir<[email protected]> wrote:
>> Obender, i ran your code and it did what I expected (but not what you 
>> pasted):
>>
>> First token is: (טוֹב,0,4)
>> Second token is: (עֶרֶב,5,10)
>>
>> I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same 
>> results.
>>
>> On Mon, Jul 20, 2009 at 2:53 PM, OBender<[email protected]> wrote:
>>> Here is the simple code. If you run it with English and with Hebrew you 
>>> will see that in case of English tokens returned from the left of the 
>>> phrase to the right and with Hebrew from the right to the left.
>>>
>>> Again I'm talking about tokens not the individual letters here.
>>>
>>> public class XFilter extends TokenFilter
>>> {
>>>        protected XFilter( TokenStream tokenStream ) {
>>>                super( tokenStream );
>>>        }
>>>
>>>       �...@override
>>>        public Token next( final Token reusableToken ) throws IOException
>>>        {
>>>                Token nextToken = input.next( reusableToken );
>>>                System.out.println( nextToken != null? nextToken: "" );
>>>                return nextToken;
>>>        }
>>> }
>>>
>>> public class SimpleWhitespaceAnalyzer extends Analyzer
>>> {
>>>       �...@override
>>>        public TokenStream tokenStream( final String fieldName, final Reader 
>>> reader )
>>>        {
>>>                TokenStream ts  = new WhitespaceTokenizer( reader );
>>>                ts                      = new XFilter( ts );
>>>
>>>                return ts;
>>>        }
>>> }
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:[email protected]]
>>> Sent: Monday, July 20, 2009 2:26 PM
>>> To: [email protected]
>>> Subject: Re: question on custom filter
>>>
>>> Obender, I think something in your environment / display environment
>>> might be causing some confusion.
>>>
>>> Are you using microsoft windows? If so, please verify that support for
>>> right-to-left languages is enabled [control panel/regional and
>>> language options]. It is possible you are "seeing something different"
>>> because your rendering system is not actually rendering right-to-left
>>> text in right-to-left direction!!!!
>>>
>>> Second, Instead of using a debugger, I would recommend using Luke to
>>> look at resulting tokens from your analyzer.
>>>
>>> On Mon, Jul 20, 2009 at 2:21 PM, OBender<[email protected]> wrote:
>>>> This is how it should be written:
>>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91
>>>>
>>>> -----Original Message-----
>>>> From: Robert Muir [mailto:[email protected]]
>>>> Sent: Monday, July 20, 2009 2:07 PM
>>>> To: [email protected]
>>>> Subject: Re: question on custom filter
>>>>
>>>> Obender, This is not true.
>>>> the text you pasted is the following in unicode:
>>>>
>>>> \N{HEBREW LETTER TET}
>>>> \N{HEBREW LETTER VAV}
>>>> \N{HEBREW POINT HOLAM}
>>>> \N{HEBREW LETTER BET}
>>>> \N{SPACE}
>>>> \N{HEBREW LETTER AYIN}
>>>> \N{HEBREW POINT SEGOL}
>>>> \N{HEBREW LETTER RESH}
>>>> \N{HEBREW POINT SEGOL}
>>>> \N{HEBREW LETTER BET}
>>>>
>>>> you can use this utility to see how your text is encoded:
>>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91
>>>>
>>>> For more information on directionality in unicode, see
>>>> http://unicode.org/reports/tr9/
>>>>
>>>> On Mon, Jul 20, 2009 at 1:59 PM, OBender<[email protected]> wrote:
>>>>> Robert,
>>>>>
>>>>> I'm not sure you are correct on this one.
>>>>>
>>>>> If I have a Hebrew phrase:
>>>>> [טוֹב עֶרֶב]
>>>>> Then first token that filter receives is:
>>>>> [עֶרֶב] (0,5)
>>>>> and the second is:
>>>>> [טוֹב] (6,10)
>>>>> Which means that it counts from right to left (words and indexes).
>>>>>
>>>>> Am I missing something?
>>>>>
>>>>> -----Original Message-----
>>>>> From: Robert Muir [mailto:[email protected]]
>>>>> Sent: Monday, July 20, 2009 1:43 PM
>>>>> To: [email protected]
>>>>> Subject: Re: question on custom filter
>>>>>
>>>>> Obender, I don't think its as difficult as you think. Your filter does
>>>>> not need to be aware of this issue at all.
>>>>>
>>>>> In unicode, right-to-left languages are encoded in the data in logical 
>>>>> order.
>>>>> The rendering system is what converts it to display in right-to-left
>>>>> for RTL languages.
>>>>>
>>>>> For example in Arabic, "Robert 1234" displays as روبرت 1234
>>>>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
>>>>> beh, waw, reh
>>>>>
>>>>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>>>>>
>>>>> 2009/7/20 OBender <[email protected]>:
>>>>>> Hi All!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Let say I have a filter that produces new tokens based on the original 
>>>>>> ones.
>>>>>>
>>>>>> How bad will it be if my filter sets the start of each token to 0 and 
>>>>>> end to
>>>>>> the length of a token?
>>>>>>
>>>>>> An example (based on the phrase "How are you?":
>>>>>>
>>>>>>
>>>>>>
>>>>>> Original token:
>>>>>>
>>>>>> [you?] (8,12)
>>>>>>
>>>>>>
>>>>>>
>>>>>> New tokens:
>>>>>>
>>>>>> [you] (0,3)
>>>>>>
>>>>>> [?] (0,1)
>>>>>>
>>>>>>
>>>>>>
>>>>>> It wouldn't be so hard to calculate the right numbers for left to right
>>>>>> languages and it is a bit more challenging to do it for right to left 
>>>>>> ones
>>>>>> but for mixed text it is quite hard.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Robert Muir
>>>>> [email protected]
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Robert Muir
>>>> [email protected]
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> [email protected]
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> [email protected]
>>
>
>
>
> --
> Robert Muir
> [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>



-- 
Robert Muir
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: question on custom filter

Reply via email to