On Tuesday 24 February 2009, 15:41, Mark Knecht wrote:
> On Tue, Feb 24, 2009 at 2:56 AM, Etaoin Shrdlu
> <shr...@unlimitedmail.org> wrote: <SNIP>
>
> > So, in my understanding this is what we want to accomplish so far:
> >
> > given an input of the form
> >
> > D1,T1,a1,b1,c1,d1,...,R1
> > D2,T2,a2,b2,c2,d2,...,R2
> > D3,T3,a3,b3,c3,d3,...,R3
> > D4,T4,a4,b4,c4,d4,...,R4
> > D5,T5,a5,b5,c5,d5,...,R5
> >
> > (the ... mean that an  arbitrary number of columns can follow)
> >
> > You want to group lines by n at a time, keeping the D and T column
> > from the first line of each group, and keeping the R column from the
> > last line of the group, so for example with n=3 we would have:
> >
> > D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3
> > D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4
> > D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5
> >
> > (and you're right, that produces an output that is roughly n times
> > the size of the original file)
> >
> > Now, in addition to that, you also want to drop an arbitrary number
> > of columns in the a,b,c... group. So for example, you want to drop
> > columns 2 and 3 (b and c in the example), so you'd end up with
> > something like
> >
> > D1,T1,a1,d1,...a2,d2,...a3,d3,...R3
> > D1,T1,a2,d2,...a3,d3,...a4,d4,...R4
> > D1,T1,a3,d3,...a4,d4,...a5,d5,...R5
> >
> > Please confirm that my understanding is correct, so I can come up
> > with some code to do that.
>
> Perfectly correct for all the data rows.
>
> For the header I now see that we have a slightly harder job. What we'd
> need to do is read the first line of the file, duplicate it N times,
> and then drop the same columns as we drop in the rows. The problem is
> that now I have the same header value for N columns which won't make
> sense to the tool that uses this data. If we could read the header and
> then automatically postpend the number N to each duplicated name. (or
> some string like _N)

So in the last example the header would be something like

D,T,a,b,c,d,...,R

in the original file, and would become 

D,T,a_1,b_1,c_1,d_1,...a_2,b_2,c_2,d_2,...a_3,b_3,c_3,d_3,...R

and

D,T,a_1,d_1,...a_2,d_2,...a_3,d_3,...R

respectively for the two sample outputs above.

> Maybe better would be a separate small program to do the header part
> and then this program could read that header and make it the first
> line of the output file. My worry is that when this data file becomes
> very large - say 1GB or more of data - I probably cannot open the file
> with vi to edit the header. Better if I could put the header in it's
> own file. That file would be 1 line long. I could check it for the
> name edits, make sure it's right, and then the program you are so
> kindly building would just read it, cut out columns, and put it at the
> start of the new large file.
>
> Does that make sense?

Maybe, but while we're at it my personal preference would be to just add 
header handling to the existing program. Here's the revised code (also 
cleaned up a bit and more structured):

# returns a ",a1,b1,c1,.." line
# drops unwanted columns
function do_line(num) {
  line=""
  for(j=3;j<NF;j++)
    if(!((j-2) in dc)){line=line OFS $j (num>0?"_"num:"")}
  return line
}

# dcols is a string like '2,3,4,7' with a comma separated values list of
# the columns to drop (first data column after date/time is column 1
# here)

BEGIN {FS=OFS=","
       t=split(dcols,tmp,/,/)
       for(i=1;i<=t;i++)dc[tmp[i]]
      }

# process the header
NR==1{
  l=$1 OFS $2;
  for(i=1;i<=n;i++)l=l do_line(i)
  l=l OFS $NF
  print l
}

NR>=2{
  for(i=1;i<n;i++){
    s[i]=s[i+1]
    dt[i]=dt[i+1]
  }

  dt[n]=$1 OFS $2
  s[n]=do_line(-1)

  if(NR>=n+1){
    l=dt[1]
    for(i=1;i<=n;i++)l=l s[i]
    l=l OFS $NF
    print l
  }
}'

I think two examples are enough here to demonstrate how it works (blank 
lines between commands added for clarity - remember first line is header 
line):

$ cat file.csv
D,T,a,b,c,d,R
D1,T1,a1,b1,c1,d1,R1
D2,T2,a2,b2,c2,d2,R2
D3,T3,a3,b3,c3,d3,R3
D4,T4,a4,b4,c4,d4,R4
D5,T5,a5,b5,c5,d5,R5
D6,T6,a6,b6,c6,d6,R6

$ awk -v n=3 -f program.awk file.csv
D,T,a_1,b_1,c_1,d_1,a_2,b_2,c_2,d_2,a_3,b_3,c_3,d_3,R
D1,T1,a1,b1,c1,d1,a2,b2,c2,d2,a3,b3,c3,d3,R3
D2,T2,a2,b2,c2,d2,a3,b3,c3,d3,a4,b4,c4,d4,R4
D3,T3,a3,b3,c3,d3,a4,b4,c4,d4,a5,b5,c5,d5,R5
D4,T4,a4,b4,c4,d4,a5,b5,c5,d5,a6,b6,c6,d6,R6

$ awk -v n=3 -v dcols='1,3' -f program.awk file.csv
D,T,b_1,d_1,b_2,d_2,b_3,d_3,R
D1,T1,b1,d1,b2,d2,b3,d3,R3
D2,T2,b2,d2,b3,d3,b4,d4,R4
D3,T3,b3,d3,b4,d4,b5,d5,R5
D4,T4,b4,d4,b5,d5,b6,d6,R6

The program still assumes that first two columns and last column must be 
excluded from the general processing, and just prepended and appended to 
the output lines. That can be changed though.

Reply via email to