kurt.alfred.muel...@gmail.com wrote: > On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote: >> On 28/8/2013 04:32, Kurt Mueller wrote: >> > For some text manipulation tasks I need a template to split lines >> > from stdin into a list of strings the way shlex.split() does it. >> > The encoding of the input can vary. > >> Does that mean it'll vary from one run of the program to the next, or >> it'll vary from one line to the next? Your code below assumes the >> latter. That can greatly increase the unreliability of the already >> dubious chardet algorithm. > > The encoding only varies from one launch to the other. > The reason I process each line is memory usage. > > Option to have a better reliability of chardet: > I could read all of the input, save the input lines for further > processing in a list, feed the lines into > chardet.universaldetector.UniversalDetector.feed()/close()/result() > and then decode and split/shlex the lines in the list. > That way the chardet oracle would be more reliable, but > roughly twice as much memory will be used.
You can compromise and read ahead a limited number of lines. Here's my demo script (The interesting part is detect_encoding(), I got a bit distracted by unrelated stuff...). The script does one extra decode/encode cycle -- it should be easy to avoid that if you run into performance issues. #!/usr/bin/env python import sys import shlex import chardet from itertools import islice, chain def detect_encoding(instream, encoding, detect_lines): if encoding is None: encoding = instream.encoding if encoding is None: head = list(islice(instream, detect_lines)) encoding = chardet.detect("".join(head))["encoding"] instream = chain(head, instream) return encoding, instream def split_line(line, comments=True, posix=True): parts = shlex.split(line.encode("utf-8"), comments=comments, posix=posix) return [part.decode("utf-8") for part in parts] def to_int(s): """ >>> to_int(" 42") 42 >>> to_int("-1") is None True >>> to_int(" NONE ") is None True >>> to_int("none") is None True >>> to_int(" 0x400 ") 1024 """ s = s.lower().strip() if s in {"none", "-1"}: return None return int(s, 16 if s.startswith("0x") else 10) def main(): import argparse parser = argparse.ArgumentParser() parser.add_argument("-e", "--encoding") parser.add_argument( "-d", "--detect-lines", type=to_int, default=100, help=("number of lines used to determine encoding; " "'none' or -1 for whole file. (default: 100)")) args = parser.parse_args() encoding, instream = detect_encoding( sys.stdin, encoding=args.encoding, detect_lines=args.detect_lines) lines = (line.decode(encoding) for line in instream) for line in lines: try: parts = split_line(line) except ValueError as exc: print >> sys.stderr, exc else: print parts if __name__ == "__main__": main() -- http://mail.python.org/mailman/listinfo/python-list