I have a long text ostensibly in utf-8, and I would like to get rid of all the lines that contain anything BUT kanji, katakana or hiragana (thus, throwing away Latin, but also digits, punctuation, etc.)
In short, I would like to do something like:
perl -ne 'if (/[^\p{Hiragana}\p{Katakana}\p{Kanji}]/){print}' webcorpus.tok > webcorpus.clean.tok
Is is possible to do something like that?
The current implemention (at least in v5.8.5, I don't know about the status in v5.8.6 -- did not have time to upgrade yet) has limitations on nesting character classes inside "[...]" character classes. From "perldoc perlunicode":
· Character classes in regular expressions match charac- ters instead of bytes and match against the character properties specified in the Unicode properties database. "\w" can be used to match a Japanese ideo- graph, for instance.
(However, and as a limitation of the current implemen- tation, using "\w" or "\W" inside a "[...]" character class will still match with byte semantics.)
That means, in v5.8.5 this does not work:
perl -CSD -ne 'print if /^[\p{Hiragana}\p{Katakana}\p{Kanji}]+$/' f > f-clean.tok
but replacing the [...] class with a group (?:...) does work:
perl -CSD -ne 'print if /^(?:\p{Hiragana}|\p{Katakana}|\p{Kanji})+$/' f > f-clean.tok
-- Paul Bijnens, Xplanation Tel +32 16 397.511 Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM Fax +32 16 397.512 http://www.xplanation.com/ email: [EMAIL PROTECTED] *********************************************************************** * I think I've got the hang of it now: exit, ^D, ^C, ^\, ^Z, ^Q, F6, * * quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, * * stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt, abort, hangup, * * PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e, kill -1 $$, shutdown, * * kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... * * ... "Are you sure?" ... YES ... Phew ... I'm out * ***********************************************************************