Buffer size for SegmentingTokenizerBase

Wang, Guan Fri, 18 Mar 2022 08:53:55 -0700

Hi,

May someone explain to me why class SegmentingTokenizerBase using a buffer with 
a size of only 1024 characters? In the source code, the comment was left there 
mentioning possible truncated token if no safe-stopping index can be found for 
the existing chars in the buffer.


It doesn't sound reasonable that a sentence is always no more than 1024 
characters or there is always a safe stopper, like new line can be found in a 
sentence.

Thanks,

Guan

**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be 
used for urgent or sensitive issues

Buffer size for SegmentingTokenizerBase

Reply via email to