Am Donnerstag, 10. September 2009 11:17:45 schrieb Knittl: > das script könnte sich dann aber auch gleich um sort und uniq kümmern ;)
Ich habe es geschrieben, um mir automatisiert Trainingslisten zu erstellen,
deswegen greife ich dann doch gerne auf die shell zurück.
Wobei...
$ date
Do 10. Sep 11:19:17 CEST 2009
$ hg ci -m 'Added "--unique" and imrpoved the docstring.'
$ hg ci -m "--unique now also sorts."
$ date
Do 10. Sep 11:29:13 CEST 2009
$ time ./wordfilter.py --unique --length 1 *txt
...
real 0m2.182s
user 0m1.970s
sys 0m0.060s
(ein kleiner persönlicher Geschwindigkeitstest, das Ergebnis ist angehängt)
> und perl ist irgendwie die klischee-sprache schlechthin für solche
> aufgaben, darum hab ich gemeint.
Für mich ist das Python :)
> meine irc-logs greppe ich übrigens immer noch – eine stunde läuft das schon
> *g* evtl. hätte ich wc -l nehmen sollen, ich kann mir gut vorstellen, dass
> das bei solchen datenmengen doch einen unterschied macht, ob jedes zeichen
> und wort oder nur die zeilenenden gezählt werden müssen.
Probiers doch mal mit meinem Skript. Das langsame ist übrigens dabei
vermutlich eher grep.
$ time grep -iowh '[uiaeosnrtdy]\+' *txt > /dev/null
real 0m36.706s
user 0m25.630s
sys 0m0.010s
$ time grep -iowh '[uiaeosnrtdy]\+' *txt | sort -u | wc
1492 1492 9769
real 0m28.982s
user 0m25.590s
sys 0m0.010s
Über stdin weitergeben ist sogar schneller, als nach /dev/null schreiben!
(Die Texte sind aus Projekt Gutenberg)
Lieben Gruß,
Arne
PS: Noch ein Vorteil von Neo: Ich kann meinen Namen mit den ersten vier Tasten
schreiben: aenr -> Arne :)
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
Ein Mann wird auf der Straße mit einem Messer bedroht.
Zwei Polizisten sind sofort da und halten ein Transparent davor.
"Illegale Szene. Niemand darf das sehen."
Der Mann wird ausgeraubt, erstochen und verblutet,
denn die Polizisten haben beide Hände voll zu tun.
Willkommen in Deutschland. Zensur ist schön.
(http://draketo.de)
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
#!/usr/bin/env python3
"""wordfilter - extract words from a set of normal textfiles and keep only those which contain only of a defined set of letters.
This allows you to train for texts which are relevant to you.
usage:
- wordfilter.py [options] [<text files>]
options:
--letters <string of letters> - only include word which contain only these letters (default 'uiaenrtd').
--remove <string of letters> - remove all these letters from the text before doing anything else (default ',.').
--length <number> - output lines with <number> words per line (default 12).
--unique - Include each word only once _and_ sort the list.
--help - print this help output.
examples:
- wordfilter.py *txt
Get all words from the text files which only contain letters
on which you have your fingers in the Neo layout :)
- wordfilter --unique --length 1 *txt
Get every word only once, one word per line.
- wordfilter.py --letters uiaenrtd --remove ",." --length 12 README *.txt
Get all words from the text files which only contain the
specified letters but ignore (and remove) ',' and '.'.
Output them in lines of length words (default 12)
Default is to show only words which can be typed with the basic row
in the neo keymap - that's what I'm writing this program for :)
- wordfilter.py --help
print this help text.
"""
__copyright__ = """
wordfilter - extract words which contain only specific letters.
-----------------------------------------------------------------
© 2009 Copyright by Arne Babenhauserheide
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston,
MA 02110-1301 USA
"""
# First we parse the basic command line arguments to get good reaction times
from sys import argv
if not argv[1:] or "--help" in argv:
print(__doc__)
# we need to be able to read the files
def read_file(path):
"""Read a file and
@return: the file content as a string."""
f = open(path)
data = f.read()
f.close()
return data
# Also we need to be able to split splti a file into wrods and filter out the words which have letters we don't want.
def get_and_filter_words(text, letters="uiaenrtd", remove=",."):
"""Split the text into words and filter out all words which have undefined letters. Before filtering remove the letters given by remove.
@param text: The text to parse.
@param letters: The letters words are allowed to contain.
@remove: The letters which get removed before filtering.
@return: A list of fitting words.
"""
# First split the text by newlines and spaces
raw_words = text.split()
# now remove the letters to ignore
words = []
for word in [list(word) for word in raw_words]:
for letter in remove:
if letter in word:
word.remove(letter)
words.append("".join(word))
# Now filter out unwanted words
raw_words = words
words = []
for word in raw_words:
# simply go to the next word, if one of the letters in the word is not in our letters.
valid = True
for letter in word:
if not letter.lower() in letters:
valid = False
if not valid:
continue
words.append(word)
# we're already done
return words
### Self-Test
if __name__ == "__main__":
# First read and remove the options from the argv
if "--letters" in argv:
letters = argv[argv.index("--letters") + 1]
argv.remove("--letters")
argv.remove(letters)
else:
letters = "uiaenrtd"
if "--remove" in argv:
remove = argv[argv.index("--remove") + 1]
argv.remove("--remove")
argv.remove(remove)
else:
remove = ",."
if "--unique" in argv:
unique = True
argv.remove("--unique")
else:
unique = False
if "--length" in argv:
length = argv[argv.index("--length") + 1]
argv.remove("--length")
argv.remove(length)
length = int(length)
else:
length = 12
# Now read all files
word_lists = [get_and_filter_words(read_file(path), letters, remove) for path in argv[1:]]
words = []
for i in word_lists:
words.extend(i)
# If we only want every word once, turn the list into a set and back again.
if unique:
words = list(set(words))
words.sort()
# and print all words in sets of 12
i = 0
while i*length < len(words):
for word in words[i*length : (i+1)*length]:
print(word, end=" ")
print()
i += 1
signature.asc
Description: This is a digitally signed message part.
