On Jan 22, 4:45 pm, Roy Smith <r...@panix.com> wrote: > In article > <e1f0636a-195c-4fbb-931a-4d619d5f0...@g27g2000yqa.googlegroups.com>, > Yigit Turgut <y.tur...@gmail.com> wrote:
> > Hi all, > > > I have a text file approximately 20mb in size and contains about one > > million lines. I was doing some processing on the data but then the > > data rate increased and it takes very long time to process. I import > > using numpy.loadtxt, here is a fragment of the data ; > > > 0.000006 -0.0004 > > 0.000071 0.0028 > > 0.000079 0.0044 > > 0.000086 0.0104 > > . > > . > > . > > > First column is the timestamp in seconds and second column is the > > data. File contains 8seconds of measurement, and I would like to be > > able to split the file into 3 parts seperated from specific time > > locations. For example I want to divide the file into 3 parts, first > > part containing 3 seconds of data, second containing 2 seconds of data > > and third containing 3 seconds. > > I would do this with standard unix tools: > > grep '^[012]' input.txt > first-three-seconds.txt > grep '^[34]' input.txt > next-two-seconds.txt > grep '^[567]' input.txt > next-three-seconds.txt > > Sure, it makes three passes over the data, but for 20 MB of data, you > could have the whole job done in less time than it took me to type this. > > As a sanity check, I would run "wc -l" on each of the files and confirm > that they add up to the original line count. This works and is very fast but it missed a few hundred lines unfortunately. On Jan 22, 5:19 pm, MRAB <pyt...@mrabarnett.plus.com> wrote: > On 22/01/2012 14:32, Yigit Turgut wrote: > > Hi all, > > > I have a text file approximately 20mb in size and contains about one > > million lines. I was doing some processing on the data but then the > > data rate increased and it takes very long time to process. I import > > using numpy.loadtxt, here is a fragment of the data ; > > > 0.000006 -0.0004 > > 0.000071 0.0028 > > 0.000079 0.0044 > > 0.000086 0.0104 > > . > > . > > . > > > First column is the timestamp in seconds and second column is the > > data. File contains 8seconds of measurement, and I would like to be > > able to split the file into 3 parts seperated from specific time > > locations. For example I want to divide the file into 3 parts, first > > part containing 3 seconds of data, second containing 2 seconds of data > > and third containing 3 seconds. Splitting based on file size doesn't > > work that accurately for this specific data, some columns become > > missing and etc. I need to split depending on the column content ; > > > 1 - read file until first character of column1 is 3 (3 seconds) > > 2 - save this region to another file > > 3 - read the file where first characters of column1 are between 3 to > > 5 (2 seconds) > > 4 - save this region to another file > > 5 - read the file where first characters of column1 are between 5 to > > 5 (3 seconds) > > 6 - save this region to another file > > > I need to do this exactly because numpy.loadtxt or genfromtxt doesn't > > get well with missing columns / rows. I even tried the invalidraise > > parameter of genfromtxt but no luck. > > > I am sure it's a few lines of code for experienced users and I would > > appreciate some guidance. > > Here's a solution in Python 3: > > input_path = "..." > section_1_path = "..." > section_2_path = "..." > section_3_path = "..." > > with open(input_path) as input_file: > try: > line = next(input_file) > > # Copy section 1. > with open(section_1_path, "w") as output_file: > while line[0] < "3": > output_file.write(line) > line = next(input_file) > > # Copy section 2. > with open(section_2_path, "w") as output_file: > while line[5] < "5": > output_file.write(line) > line = next(input_file) > > # Copy section 3. > with open(section_3_path, "w") as output_file: > while True: > output_file.write(line) > line = next(input_file) > except StopIteration: > pass With the following correction ; while line[5] < "5": should be while line[0] < "5": This works well. On Jan 22, 5:39 pm, Arnaud Delobelle <arno...@gmail.com> wrote: > On 22 January 2012 15:19, MRAB <pyt...@mrabarnett.plus.com> wrote: > > Here's a solution in Python 3: > > > input_path = "..." > > section_1_path = "..." > > section_2_path = "..." > > section_3_path = "..." > > > with open(input_path) as input_file: > > try: > > line = next(input_file) > > > # Copy section 1. > > with open(section_1_path, "w") as output_file: > > while line[0] < "3": > > output_file.write(line) > > line = next(input_file) > > > # Copy section 2. > > with open(section_2_path, "w") as output_file: > > while line[5] < "5": > > output_file.write(line) > > line = next(input_file) > > > # Copy section 3. > > with open(section_3_path, "w") as output_file: > > while True: > > output_file.write(line) > > line = next(input_file) > > except StopIteration: > > pass > > -- > >http://mail.python.org/mailman/listinfo/python-list > > Or more succintly (but not tested): > > sections = [ > ("3", "section_1") > ("5", "section_2") > ("\xFF", "section_3") > ] > > with open(input_path) as input_file: > lines = iter(input_file) > for end, path in sections: > with open(path, "w") as output_file: > for line in lines: > if line >= end: > break > output_file.write(line) > > -- > Arnaud Good idea. Especially when dealing with variable numbers of sections. But somehow I got ; ("5", "section_2") TypeError: 'tuple' object is not callable -- http://mail.python.org/mailman/listinfo/python-list