Am Donnerstag, 10. September 2009 11:17:45 schrieb Knittl:
> das script könnte sich dann aber auch gleich um sort und uniq kümmern ;)

Ich habe es geschrieben, um mir automatisiert Trainingslisten zu erstellen, 
deswegen greife ich dann doch gerne auf die shell zurück. 

Wobei... 

$ date
Do 10. Sep 11:19:17 CEST 2009

$ hg ci -m 'Added "--unique" and imrpoved the docstring.'
$ hg ci -m "--unique now also sorts."

$ date
Do 10. Sep 11:29:13 CEST 2009

$ time ./wordfilter.py --unique --length 1 *txt
...
real    0m2.182s
user    0m1.970s
sys     0m0.060s

(ein kleiner persönlicher Geschwindigkeitstest, das Ergebnis ist angehängt)
 
> und perl ist irgendwie die klischee-sprache schlechthin für solche
> aufgaben, darum hab ich gemeint.

Für mich ist das Python :) 

> meine irc-logs greppe ich übrigens immer noch – eine stunde läuft das schon
>  *g* evtl. hätte ich wc -l nehmen sollen, ich kann mir gut vorstellen, dass
>  das bei solchen datenmengen doch einen unterschied macht, ob jedes zeichen
>  und wort oder nur die zeilenenden gezählt werden müssen.

Probiers doch mal mit meinem Skript.  Das langsame ist übrigens dabei 
vermutlich eher grep. 

$ time grep -iowh '[uiaeosnrtdy]\+' *txt  > /dev/null

real    0m36.706s
user    0m25.630s
sys     0m0.010s

$ time grep -iowh '[uiaeosnrtdy]\+' *txt | sort -u | wc
   1492    1492    9769

real    0m28.982s
user    0m25.590s
sys     0m0.010s

Über stdin weitergeben ist sogar schneller, als nach /dev/null schreiben! 

(Die Texte sind aus Projekt Gutenberg)

Lieben Gruß, 
Arne

PS: Noch ein Vorteil von Neo: Ich kann meinen Namen mit den ersten vier Tasten 
schreiben: aenr -> Arne :) 

--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
Ein Mann wird auf der Straße mit einem Messer bedroht. 
Zwei Polizisten sind sofort da und halten ein Transparent davor. 

        "Illegale Szene. Niemand darf das sehen."

Der Mann wird ausgeraubt, erstochen und verblutet, 
denn die Polizisten haben beide Hände voll zu tun. 

Willkommen in Deutschland. Zensur ist schön. 
                      (http://draketo.de)
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

#!/usr/bin/env python3

"""wordfilter - extract words from a set of normal textfiles and keep only those which contain only of a defined set of letters.

This allows you to train for texts which are relevant to you.  

usage:

    - wordfilter.py [options] [<text files>]

options:

    --letters <string of letters> - only include word which contain only these letters (default 'uiaenrtd').
    --remove <string of letters> - remove all these letters from the text before doing anything else (default ',.').
    --length <number> - output lines with <number> words per line (default 12).
    --unique - Include each word only once _and_ sort the list.
    --help - print this help output.

examples: 

    - wordfilter.py *txt
      Get all words from the text files which only contain letters
      on which you have your fingers in the Neo layout :) 

    - wordfilter --unique --length 1 *txt
      Get every word only once, one word per line. 

    - wordfilter.py --letters uiaenrtd --remove ",." --length 12 README *.txt
      Get all words from the text files which only contain the
      specified letters but ignore (and remove) ',' and '.'.
      Output them in lines of length words (default 12)
      Default is to show only words which can be typed with the basic row
      in the neo keymap - that's what I'm writing this program for :) 

    - wordfilter.py --help
      print this help text. 

"""

__copyright__ = """ 
  wordfilter - extract words which contain only specific letters. 
----------------------------------------------------------------- 
© 2009 Copyright by Arne Babenhauserheide

  This program is free software; you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
  the Free Software Foundation; either version 2 of the License, or
  (at your option) any later version.

  This program is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  GNU General Public License for more details.

  You should have received a copy of the GNU General Public License
  along with this program; if not, write to the Free Software
  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston,
  MA 02110-1301 USA

""" 

# First we parse the basic command line arguments to get good reaction times
from sys import argv
if not argv[1:] or "--help" in argv:
    print(__doc__)

# we need to be able to read the files
def read_file(path):
    """Read a file and
    @return: the file content as a string."""

    f = open(path)
    data = f.read()
    f.close()
    return data

# Also we need to be able to split splti a file into wrods and filter out the words which have letters we don't want.
def get_and_filter_words(text, letters="uiaenrtd", remove=",."):
    """Split the text into words and filter out all words which have undefined letters.  Before filtering remove the letters given by remove.
    @param text: The text to parse. 
    @param letters: The letters words are allowed to contain.
    @remove: The letters which get removed before filtering.
    @return: A list of fitting words.
    """
    # First split the text by newlines and spaces
    raw_words = text.split()
    # now remove the letters to ignore
    words = []
    for word in [list(word) for word in raw_words]:
        for letter in remove:
            if letter in word: 
                word.remove(letter)
        words.append("".join(word))
    # Now filter out unwanted words
    raw_words = words
    words = []
    for word in raw_words:
        # simply go to the next word, if one of the letters in the word is not in our letters.
        valid = True
        for letter in word:
            if not letter.lower() in letters:
                valid = False
        if not valid:
            continue
        words.append(word)
    # we're already done
    return words

### Self-Test

if __name__ == "__main__":
    # First read and remove the options from the argv
    if "--letters" in argv:
        letters = argv[argv.index("--letters") + 1]
        argv.remove("--letters")
        argv.remove(letters)
    else:
        letters = "uiaenrtd"

    if "--remove" in argv:
        remove = argv[argv.index("--remove") + 1]
        argv.remove("--remove")
        argv.remove(remove)
    else:
        remove = ",."

    if "--unique" in argv:
        unique = True
        argv.remove("--unique")
    else:
        unique = False

    if "--length" in argv:
        length = argv[argv.index("--length") + 1]
        argv.remove("--length")
        argv.remove(length)
        length = int(length)
    else:
        length = 12


    # Now read all files
    word_lists = [get_and_filter_words(read_file(path), letters, remove) for path in argv[1:]]
    words = []
    for i in word_lists:
        words.extend(i)

    # If we only want every word once, turn the list into a set and back again.
    if unique: 
        words = list(set(words))
        words.sort()

    # and print all words in sets of 12
    i = 0
    while i*length < len(words): 
        for word in words[i*length : (i+1)*length]:
            print(word, end=" ")
        print()
        i += 1
    

Attachment: signature.asc
Description: This is a digitally signed message part.

Antwort per Email an