Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Robert Muir Thu, 20 Aug 2009 08:20:43 -0700

Valery, oh I think there might be other ways to solve this.

But you provided some examples such as C/C++ and SAP R/3.
In these two examples you want the "/" to behave differently depending
upon context, so my first thought was that a grammar might be a good
way to ensure it does what you want.


On Thu, Aug 20, 2009 at 11:09 AM, Valery<[email protected]> wrote:
>
> Hi Robert,
>
> thanks for the hint.
>
> Indeed, a natural way to go. Especially if one builds a Tokenizer of the
> level of quality like StandardTokenizer's.
>
> OTOH, you mean that the out-of-the-box stuff is indeed not customizable for
> this task?..
>
> regards
> Valery
>
>
>
> Robert Muir wrote:
>>
>> Valery,
>>
>> One thing you could try would be to create a JFlex-based tokenizer,
>> specifying a grammar with the rules you want.
>> You could use the source code & grammar of StandardTokenizer as a
>> starting point.
>>
>>
>> On Thu, Aug 20, 2009 at 10:28 AM, Valery<[email protected]> wrote:
>>>
>>> Hi all,
>>>
>>> I am trying to tune Lucene to respect such tokens like C++, C#, .NET
>>>
>>> The task is known for Lucene community, but surprisingly I can't google
>>> out
>>> somewhat good info on it.
>>>
>>> Of course, I tried to re-use Lucene's  building blocks for Tokenizer.
>>> Here
>>> we go:
>>>
>>>  1) StandardTokenizer -- oh, this option would be just fantastic, but
>>> "C++,
>>> C#, .NET" ends up with "c c net". Too bad.
>>>
>>>  2) WhitespaceTokenizer gives me a lot of lexems that are actually should
>>> have been chopped into smaller pieces. Example: "C/C++" comes out like a
>>> single lexem. If I follow this way I end-up with "Tokenization of tokens"
>>> --
>>> that sounds a bit odd, doesn't it?
>>>
>>>  3) CharTokenizer allows me to add the '/' to be also a token-emitting
>>> char, but then '/' gets immediately lost like those whitespace chars. In
>>> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search
>>> the
>>> original char stream for the "/" char to re-build "SAP R/3" term as a
>>> whole.
>>>
>>> Do you see any other relevant building blocks missed by me?
>>>
>>> Also, people around there have meant that such problem should be solved
>>> by a
>>> synonym dictionary. However this hint sheds no light on which
>>> tokenization
>>> strategy should be more appropriate *before* the synonym step.
>>>
>>> So, it looks like I have to take the class CharTokenizer as for the
>>> starting
>>> point and write anew my own Tokenizer. This Tokenizer should also react
>>> on
>>> delimiting characters and emit the token. However, it should distinguish
>>> between delimiters like whitespaces along with ";,?" and the delimiters
>>> like
>>> "./&".
>>>
>>> Indeed, the delimiters like whitespaces and ";,?" should be thrown away
>>> from
>>> Lexem level,
>>> whereas the token emitting characters like "./&" should be kept in Lexem
>>> level.
>>>
>>> Your comments, gurus?
>>>
>>> regards,
>>> Valery
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> [email protected]
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063964.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>



-- 
Robert Muir
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Reply via email to