On Tue, Feb 24, 2009 at 9:48 AM, Etaoin Shrdlu <shr...@unlimitedmail.org> wrote:
> On Tuesday 24 February 2009, 15:41, Mark Knecht wrote:
>> On Tue, Feb 24, 2009 at 2:56 AM, Etaoin Shrdlu
>> <shr...@unlimitedmail.org> wrote: <SNIP>
>>
>> > So, in my understanding this is what we want to accomplish so far:
>> >
>> > given an input of the form
>> >
>> > D1,T1,a1,b1,c1,d1,...,R1
>> > D2,T2,a2,b2,c2,d2,...,R2
>> > D3,T3,a3,b3,c3,d3,...,R3
>> > D4,T4,a4,b4,c4,d4,...,R4
>> > D5,T5,a5,b5,c5,d5,...,R5
>> >
>> > (the ... mean that an  arbitrary number of columns can follow)
>> >
>> > You want to group lines by n at a time, keeping the D and T column
>> > from the first line of each group, and keeping the R column from the
>> > last line of the group, so for example with n=3 we would have:
>> >
>> > D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3
>> > D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4
>> > D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5
>> >
>> > (and you're right, that produces an output that is roughly n times
>> > the size of the original file)
>> >
>> > Now, in addition to that, you also want to drop an arbitrary number
>> > of columns in the a,b,c... group. So for example, you want to drop
>> > columns 2 and 3 (b and c in the example), so you'd end up with
>> > something like
>> >
>> > D1,T1,a1,d1,...a2,d2,...a3,d3,...R3
>> > D1,T1,a2,d2,...a3,d3,...a4,d4,...R4
>> > D1,T1,a3,d3,...a4,d4,...a5,d5,...R5
>> >
>> > Please confirm that my understanding is correct, so I can come up
>> > with some code to do that.
>>
>> Perfectly correct for all the data rows.
>>
>> For the header I now see that we have a slightly harder job. What we'd
>> need to do is read the first line of the file, duplicate it N times,
>> and then drop the same columns as we drop in the rows. The problem is
>> that now I have the same header value for N columns which won't make
>> sense to the tool that uses this data. If we could read the header and
>> then automatically postpend the number N to each duplicated name. (or
>> some string like _N)
>
> So in the last example the header would be something like
>
> D,T,a,b,c,d,...,R
>
> in the original file, and would become
>
> D,T,a_1,b_1,c_1,d_1,...a_2,b_2,c_2,d_2,...a_3,b_3,c_3,d_3,...R
>
> and
>
> D,T,a_1,d_1,...a_2,d_2,...a_3,d_3,...R
>
> respectively for the two sample outputs above.
>
>> Maybe better would be a separate small program to do the header part
>> and then this program could read that header and make it the first
>> line of the output file. My worry is that when this data file becomes
>> very large - say 1GB or more of data - I probably cannot open the file
>> with vi to edit the header. Better if I could put the header in it's
>> own file. That file would be 1 line long. I could check it for the
>> name edits, make sure it's right, and then the program you are so
>> kindly building would just read it, cut out columns, and put it at the
>> start of the new large file.
>>
>> Does that make sense?
>
> Maybe, but while we're at it my personal preference would be to just add
> header handling to the existing program. Here's the revised code (also
> cleaned up a bit and more structured):
>
> # returns a ",a1,b1,c1,.." line
> # drops unwanted columns
> function do_line(num) {
>  line=""
>  for(j=3;j<NF;j++)
>    if(!((j-2) in dc)){line=line OFS $j (num>0?"_"num:"")}
>  return line
> }
>
> # dcols is a string like '2,3,4,7' with a comma separated values list of
> # the columns to drop (first data column after date/time is column 1
> # here)
>
> BEGIN {FS=OFS=","
>       t=split(dcols,tmp,/,/)
>       for(i=1;i<=t;i++)dc[tmp[i]]
>      }
>
> # process the header
> NR==1{
>  l=$1 OFS $2;
>  for(i=1;i<=n;i++)l=l do_line(i)
>  l=l OFS $NF
>  print l
> }
>
> NR>=2{
>  for(i=1;i<n;i++){
>    s[i]=s[i+1]
>    dt[i]=dt[i+1]
>  }
>
>  dt[n]=$1 OFS $2
>  s[n]=do_line(-1)
>
>  if(NR>=n+1){
>    l=dt[1]
>    for(i=1;i<=n;i++)l=l s[i]
>    l=l OFS $NF
>    print l
>  }
> }'
>
> I think two examples are enough here to demonstrate how it works (blank
> lines between commands added for clarity - remember first line is header
> line):
>
> $ cat file.csv
> D,T,a,b,c,d,R
> D1,T1,a1,b1,c1,d1,R1
> D2,T2,a2,b2,c2,d2,R2
> D3,T3,a3,b3,c3,d3,R3
> D4,T4,a4,b4,c4,d4,R4
> D5,T5,a5,b5,c5,d5,R5
> D6,T6,a6,b6,c6,d6,R6
>
> $ awk -v n=3 -f program.awk file.csv
> D,T,a_1,b_1,c_1,d_1,a_2,b_2,c_2,d_2,a_3,b_3,c_3,d_3,R
> D1,T1,a1,b1,c1,d1,a2,b2,c2,d2,a3,b3,c3,d3,R3
> D2,T2,a2,b2,c2,d2,a3,b3,c3,d3,a4,b4,c4,d4,R4
> D3,T3,a3,b3,c3,d3,a4,b4,c4,d4,a5,b5,c5,d5,R5
> D4,T4,a4,b4,c4,d4,a5,b5,c5,d5,a6,b6,c6,d6,R6
>
> $ awk -v n=3 -v dcols='1,3' -f program.awk file.csv
> D,T,b_1,d_1,b_2,d_2,b_3,d_3,R
> D1,T1,b1,d1,b2,d2,b3,d3,R3
> D2,T2,b2,d2,b3,d3,b4,d4,R4
> D3,T3,b3,d3,b4,d4,b5,d5,R5
> D4,T4,b4,d4,b5,d5,b6,d6,R6
>
> The program still assumes that first two columns and last column must be
> excluded from the general processing, and just prepended and appended to
> the output lines. That can be changed though.
>
>
Excellent! Worked identically for me once I removed an extra single
quote after the last brace that somehow crept into your copy/paste in
this email. Really good stuff.

I tried it on a 40MB data file from TradeStation with 6 years of 5
minute Dow Futures data. Going 10 wide and deleting 1 column it
generated a 292MB output file in about 1 minute on my machine. That's
more than acceptable performance. Data mining will likely take hours
if not days. Certainly more than a few minutes.

Looks like I'm running into one more problem and then I'm ready to
give it a try for real. Unfortunately one vendor platform is putting
quotes around the names in the header row so your _N increment looks
like "High"_4 instead of High_4 or "High_4". I'd like to fix that as
I'm pretty sure that the way we have it won't be acceptable, but I
don't know whether it would be best to have the quotes or not have the
quotes. My two target data mining platforms are R, which is in
portage, and RapidMiner which is available as Open Source from the
Rapid-i web site. I'll try it both ways with both header formats and
see what happens.

I had worried about checking the header on a really large file to see
if I had cut the correct columns but it turns out that

cat awkDataOut.csv | more

in a terminal writes the first few lines very quickly. From there I
can either just look at it or copy/paste into a new csv file, load it
into something like Open Office Calc and make sure I got the right
columns so I don't think there's any practical need to do anything
more with the header other than whatever turns out to be the right
answer with the quotes. My worry had been that when I request 5 data
columns it's not obvious what order they are provided so I'd have to
look at the file and figure out where everything was. Turns out it's
not such a big deal.

This has been extremely helpful. I really am embarrassed that you did
all the work. I own you at lest a beer or two (or whatever your
preferred drink might be) should we ever have a chance to exercise our
right arms together one day. I'll drop you a note back when I figure
out what makes sense about the header quotes.

Cheers,
Mark

Reply via email to