[CLucene-dev] Proposed Changes to StandardTokenizer

Achyuth Pramod Thu, 23 Mar 2023 03:00:36 -0700

Dear Developers,

I am writing to request your assistance in verifying some proposed changes
to StandardTokenizer for my use case.
Specifically, we would like to know if the changes we plan to make will
function as intended and not cause any unintended consequences.
 into
When using Java Lucene 9.5, a text field containing "text&search" is
tokenized into:
1. text
2. search
using '&' as a delimiter.


Similarly when using CLucene 2.3.3.4, the same field is tokenized into:
1. text&search

 As our use case requires the field to be split into 2 terms, some
modifications were made to StandardTokenizer.cpp,

In StandardTokenizer::ReadAlphaNum(const TCHAR prev, Token* t),
case '&' was commented out. (Line number 278-280)
Post the changes the above mentioned string gets tokenized to 2 terms.
(text, search)

I want to know if the change made is appropriate or not.

Please take some time to review the changes and let us know your thoughts.
If you have any concerns, suggestions, or questions, please do not hesitate
to reach out to me.
Thank you in advance for your help and expertise. We look forward to
hearing from you.

Best regards,
Achyuth Pramod

_______________________________________________
CLucene-developers mailing list
CLucene-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/clucene-developers

[CLucene-dev] Proposed Changes to StandardTokenizer

Reply via email to