Undo hyphenation when indexing

Wulf Berschin Fri, 01 Apr 2011 08:50:57 -0700

Hi,

for indexing PDF files we have to undo word hyphenation. The basic ideais simply to remove the hyphen when a new line and a small letterfollows. Of course this approach isnt 100%-foolproofed but checkingagainst a dictionary wouldnt be as well...

Since we face this problem too when highlighting using HTMLCharStripper(yes, we do have hyphenation in our HTML docs...) it seems to me I haveto adjust the JFlex generated StandardTokenizerImpl.


Is this the right approach and hwo would I have to modify this script?

Thanks
Wulf

PS: I see that there are changes made in the brand new 3.1.0 version weare using 3.0.3, but as far I understand no relevant changes in thisrespect.



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Undo hyphenation when indexing

Reply via email to