On 27/06/2012 15:26, Reinier Olislagers wrote: > Splitting a string with a sentence into words shouldn't be hard, should it? > > I've adapted the UTF8 examples on the wiki article to this: [1] > > I'm assuming this must have been done countless times before. > 1) Would somebody know a good solution which I can use? E.g. in the > syntaxhighlighter etc, but I'm a bit lazy/afraid to start looking ;) > 2) There are probably things I've missed or done incorrectly below. For > my edifications, hints and tips more than welcome. > > Thanks, > Reinier > > [1] > uses > ... > LazUTF8 > ... > procedure Sentence2Words(const Sentence: UTF8String; Words: TStringList); > // Splits words on spaces and lower ASCII characters including #10,#13 > into strings > // It will collapse/ignore multiple spaces > // Expects UTF8 sentences; generates UTF8 words > var > p: PChar; > CharacterIndex: integer; > CharLen: integer; > FirstCharacter: integer; > FoundSpace: boolean; > begin > Words.Clear; > // Indicate we start on character 1: > CharacterIndex:=1; > FirstCharacter:=1; > > FoundSpace:=false; > p:=PChar(Sentence); > repeat > CharLen := UTF8CharacterLength(p); > //todo: find out other UTF8 word delimiters! > case Ord(P[0]) of > 0..32: > // Skip double spaces... > if not(FoundSpace) then > begin > // Add what we have up to now > FoundSpace:=true; > > Words.Add(UTF8Copy(Sentence,FirstCharacter,CharacterIndex-FirstCharacter)); > FirstCharacter:=CharacterIndex+1; //skip this space character > end; > else FoundSpace:=false; > end; > inc(p,CharLen); //Take # bytes into account > CharacterIndex:=CharacterIndex+1; > until (CharLen=0) or (p^ = #0); > // Last word if it didn't end with a space: > if not(FoundSpace) then > Words.Add(UTF8Copy(Sentence,FirstCharacter,CharacterIndex)); > end; >
TStringList.Delimiter (with careful use of .StrictDelimiter) is your friend :) L. -- _______________________________________________ Lazarus mailing list [email protected] http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
