Re: [gentoo-user] [OT] - command line read *.csv create new file

2009-02-25 Thread Etaoin Shrdlu
On Tuesday 24 February 2009, 23:51, Mark Knecht wrote:

 Looks like I'm running into one more problem and then I'm ready to
 give it a try for real. Unfortunately one vendor platform is putting
 quotes around the names in the header row so your _N increment looks
 like High_4 instead of High_4 or High_4. I'd like to fix that as
 I'm pretty sure that the way we have it won't be acceptable, but I
 don't know whether it would be best to have the quotes or not have the
 quotes. My two target data mining platforms are R, which is in
 portage, and RapidMiner which is available as Open Source from the
 Rapid-i web site. I'll try it both ways with both header formats and
 see what happens.

Ok, in any case that is a minor fix and adjusting the program is no big 
deal.

 I had worried about checking the header on a really large file to see
 if I had cut the correct columns but it turns out that

 cat awkDataOut.csv | more

 in a terminal writes the first few lines very quickly. From there I
 can either just look at it or copy/paste into a new csv file, load it
 into something like Open Office Calc and make sure I got the right
 columns so I don't think there's any practical need to do anything
 more with the header other than whatever turns out to be the right
 answer with the quotes. My worry had been that when I request 5 data
 columns it's not obvious what order they are provided so I'd have to
 look at the file and figure out where everything was. Turns out it's
 not such a big deal.

The currently implemented rule is as follows: if you request to not have, 
say, columns 2 and 5 (out of, say, a total of 5 in the original file - 
besides date/time and result), you get the columns in the same order 
they were in the original file, minus the ones you don't want, so in 
this example you will get columns 1, 3 and 4 in that order.



Re: [gentoo-user] [OT] - command line read *.csv create new file

2009-02-24 Thread Etaoin Shrdlu
On Tuesday 24 February 2009, 03:26, Mark Knecht wrote:

 If I drop columns - and I do need to - then something like how cut
 works would be good, but it needs to repeat across all the rows being
 used. For instance, if I'm dropping columns 6  12 from a 20 column
 wide data set, then I'm dropping 6  12 from all N lines. This is
 where using cut after the line is built is difficult as I'm forced to
 figure out a list like 6,12,26,32,46,52, etc. Easy to make a mistake
 doing that. If I could say something like Drop 6  12 from all rows,
 and 1  2 from all rows higher than the first that make up this new
 line then that would be great. That's a lot to ask though.

 D1,T1,A1,B1,C1,D1,
 D2,T2,A2,B2,C2,D2,
 D3,T3,A3,B3,C3,D3,
 D4,T4,A4,B4,C4,D4,
 D5,T5,A5,B5,C5,D5,

 In the data above if I drop column A, then I drop it for all rows.
 (For instance, A contains 0 and isn't necessary, etc.)  Assuming 3
 wide I'd get

 D1,T1,B1,C1,D1,B2,C2,D2,B3,C3,D3
 D2,T2,B2,C2,D2,B3,C3,D3,B4,C4,D4
 D3,T3,B3,C3,D3,B4,C4,D4,B5,C5,D5

 Making that completely flexible - where I can drop 4 or 5 random
 columns - is probably a bit too much work. On the other hand maybe
 sending it to cut as part of the whole process, line by lone or
 something, is more reasonable? I don't know.

The current dropcol variable drops fields from the beginning of line. 
Doing that for arbitrary columns can be done, but requires an array 
where to save the numbers of the columns to drop. 

So, in my understanding this is what we want to accomplish so far:

given an input of the form

D1,T1,a1,b1,c1,d1,...,R1
D2,T2,a2,b2,c2,d2,...,R2
D3,T3,a3,b3,c3,d3,...,R3
D4,T4,a4,b4,c4,d4,...,R4
D5,T5,a5,b5,c5,d5,...,R5

(the ... mean that an  arbitrary number of columns can follow)

You want to group lines by n at a time, keeping the D and T column from 
the first line of each group, and keeping the R column from the last 
line of the group, so for example with n=3 we would have:

D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3
D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4
D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5

(and you're right, that produces an output that is roughly n times the 
size of the original file)

Now, in addition to that, you also want to drop an arbitrary number of  
columns in the a,b,c... group. So for example, you want to drop columns 
2 and 3 (b and c in the example), so you'd end up with something like

D1,T1,a1,d1,...a2,d2,...a3,d3,...R3
D1,T1,a2,d2,...a3,d3,...a4,d4,...R4
D1,T1,a3,d3,...a4,d4,...a5,d5,...R5

Please confirm that my understanding is correct, so I can come up with 
some code to do that.

 I found a web site to study awk so I'm starting to see more or less
 how your example works when I have the code in front of me. Creating
 the code out of thin air might be a bit of a stretch for me at this
 point though.

I suggest you start from 

http://www.gnu.org/software/gawk/manual/gawk.html

really complete, but gradual so you can have an easy start and move on to 
the complexities later.



Re: [gentoo-user] [OT] - command line read *.csv create new file

2009-02-24 Thread Mark Knecht
On Tue, Feb 24, 2009 at 2:56 AM, Etaoin Shrdlu shr...@unlimitedmail.org wrote:
SNIP

 So, in my understanding this is what we want to accomplish so far:

 given an input of the form

 D1,T1,a1,b1,c1,d1,...,R1
 D2,T2,a2,b2,c2,d2,...,R2
 D3,T3,a3,b3,c3,d3,...,R3
 D4,T4,a4,b4,c4,d4,...,R4
 D5,T5,a5,b5,c5,d5,...,R5

 (the ... mean that an  arbitrary number of columns can follow)

 You want to group lines by n at a time, keeping the D and T column from
 the first line of each group, and keeping the R column from the last
 line of the group, so for example with n=3 we would have:

 D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3
 D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4
 D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5

 (and you're right, that produces an output that is roughly n times the
 size of the original file)

 Now, in addition to that, you also want to drop an arbitrary number of
 columns in the a,b,c... group. So for example, you want to drop columns
 2 and 3 (b and c in the example), so you'd end up with something like

 D1,T1,a1,d1,...a2,d2,...a3,d3,...R3
 D1,T1,a2,d2,...a3,d3,...a4,d4,...R4
 D1,T1,a3,d3,...a4,d4,...a5,d5,...R5

 Please confirm that my understanding is correct, so I can come up with
 some code to do that.

Perfectly correct for all the data rows.

For the header I now see that we have a slightly harder job. What we'd
need to do is read the first line of the file, duplicate it N times,
and then drop the same columns as we drop in the rows. The problem is
that now I have the same header value for N columns which won't make
sense to the tool that uses this data. If we could read the header and
then automatically postpend the number N to each duplicated name. (or
some string like _N)

Maybe better would be a separate small program to do the header part
and then this program could read that header and make it the first
line of the output file. My worry is that when this data file becomes
very large - say 1GB or more of data - I probably cannot open the file
with vi to edit the header. Better if I could put the header in it's
own file. That file would be 1 line long. I could check it for the
name edits, make sure it's right, and then the program you are so
kindly building would just read it, cut out columns, and put it at the
start of the new large file.

Does that make sense?


 I found a web site to study awk so I'm starting to see more or less
 how your example works when I have the code in front of me. Creating
 the code out of thin air might be a bit of a stretch for me at this
 point though.

 I suggest you start from

 http://www.gnu.org/software/gawk/manual/gawk.html

 really complete, but gradual so you can have an easy start and move on to
 the complexities later.



Yes, very complete. A good reference. Thanks!

Cheers,
Mark



Re: [gentoo-user] [OT] - command line read *.csv create new file

2009-02-24 Thread Etaoin Shrdlu
On Tuesday 24 February 2009, 15:41, Mark Knecht wrote:
 On Tue, Feb 24, 2009 at 2:56 AM, Etaoin Shrdlu
 shr...@unlimitedmail.org wrote: SNIP

  So, in my understanding this is what we want to accomplish so far:
 
  given an input of the form
 
  D1,T1,a1,b1,c1,d1,...,R1
  D2,T2,a2,b2,c2,d2,...,R2
  D3,T3,a3,b3,c3,d3,...,R3
  D4,T4,a4,b4,c4,d4,...,R4
  D5,T5,a5,b5,c5,d5,...,R5
 
  (the ... mean that an  arbitrary number of columns can follow)
 
  You want to group lines by n at a time, keeping the D and T column
  from the first line of each group, and keeping the R column from the
  last line of the group, so for example with n=3 we would have:
 
  D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3
  D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4
  D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5
 
  (and you're right, that produces an output that is roughly n times
  the size of the original file)
 
  Now, in addition to that, you also want to drop an arbitrary number
  of columns in the a,b,c... group. So for example, you want to drop
  columns 2 and 3 (b and c in the example), so you'd end up with
  something like
 
  D1,T1,a1,d1,...a2,d2,...a3,d3,...R3
  D1,T1,a2,d2,...a3,d3,...a4,d4,...R4
  D1,T1,a3,d3,...a4,d4,...a5,d5,...R5
 
  Please confirm that my understanding is correct, so I can come up
  with some code to do that.

 Perfectly correct for all the data rows.

 For the header I now see that we have a slightly harder job. What we'd
 need to do is read the first line of the file, duplicate it N times,
 and then drop the same columns as we drop in the rows. The problem is
 that now I have the same header value for N columns which won't make
 sense to the tool that uses this data. If we could read the header and
 then automatically postpend the number N to each duplicated name. (or
 some string like _N)

So in the last example the header would be something like

D,T,a,b,c,d,...,R

in the original file, and would become 

D,T,a_1,b_1,c_1,d_1,...a_2,b_2,c_2,d_2,...a_3,b_3,c_3,d_3,...R

and

D,T,a_1,d_1,...a_2,d_2,...a_3,d_3,...R

respectively for the two sample outputs above.

 Maybe better would be a separate small program to do the header part
 and then this program could read that header and make it the first
 line of the output file. My worry is that when this data file becomes
 very large - say 1GB or more of data - I probably cannot open the file
 with vi to edit the header. Better if I could put the header in it's
 own file. That file would be 1 line long. I could check it for the
 name edits, make sure it's right, and then the program you are so
 kindly building would just read it, cut out columns, and put it at the
 start of the new large file.

 Does that make sense?

Maybe, but while we're at it my personal preference would be to just add 
header handling to the existing program. Here's the revised code (also 
cleaned up a bit and more structured):

# returns a ,a1,b1,c1,.. line
# drops unwanted columns
function do_line(num) {
  line=
  for(j=3;jNF;j++)
if(!((j-2) in dc)){line=line OFS $j (num0?_num:)}
  return line
}

# dcols is a string like '2,3,4,7' with a comma separated values list of
# the columns to drop (first data column after date/time is column 1
# here)

BEGIN {FS=OFS=,
   t=split(dcols,tmp,/,/)
   for(i=1;i=t;i++)dc[tmp[i]]
  }

# process the header
NR==1{
  l=$1 OFS $2;
  for(i=1;i=n;i++)l=l do_line(i)
  l=l OFS $NF
  print l
}

NR=2{
  for(i=1;in;i++){
s[i]=s[i+1]
dt[i]=dt[i+1]
  }

  dt[n]=$1 OFS $2
  s[n]=do_line(-1)

  if(NR=n+1){
l=dt[1]
for(i=1;i=n;i++)l=l s[i]
l=l OFS $NF
print l
  }
}'

I think two examples are enough here to demonstrate how it works (blank 
lines between commands added for clarity - remember first line is header 
line):

$ cat file.csv
D,T,a,b,c,d,R
D1,T1,a1,b1,c1,d1,R1
D2,T2,a2,b2,c2,d2,R2
D3,T3,a3,b3,c3,d3,R3
D4,T4,a4,b4,c4,d4,R4
D5,T5,a5,b5,c5,d5,R5
D6,T6,a6,b6,c6,d6,R6

$ awk -v n=3 -f program.awk file.csv
D,T,a_1,b_1,c_1,d_1,a_2,b_2,c_2,d_2,a_3,b_3,c_3,d_3,R
D1,T1,a1,b1,c1,d1,a2,b2,c2,d2,a3,b3,c3,d3,R3
D2,T2,a2,b2,c2,d2,a3,b3,c3,d3,a4,b4,c4,d4,R4
D3,T3,a3,b3,c3,d3,a4,b4,c4,d4,a5,b5,c5,d5,R5
D4,T4,a4,b4,c4,d4,a5,b5,c5,d5,a6,b6,c6,d6,R6

$ awk -v n=3 -v dcols='1,3' -f program.awk file.csv
D,T,b_1,d_1,b_2,d_2,b_3,d_3,R
D1,T1,b1,d1,b2,d2,b3,d3,R3
D2,T2,b2,d2,b3,d3,b4,d4,R4
D3,T3,b3,d3,b4,d4,b5,d5,R5
D4,T4,b4,d4,b5,d5,b6,d6,R6

The program still assumes that first two columns and last column must be 
excluded from the general processing, and just prepended and appended to 
the output lines. That can be changed though.



Re: [gentoo-user] [OT] - command line read *.csv create new file

2009-02-24 Thread Mark Knecht
On Tue, Feb 24, 2009 at 9:48 AM, Etaoin Shrdlu shr...@unlimitedmail.org wrote:
 On Tuesday 24 February 2009, 15:41, Mark Knecht wrote:
 On Tue, Feb 24, 2009 at 2:56 AM, Etaoin Shrdlu
 shr...@unlimitedmail.org wrote: SNIP

  So, in my understanding this is what we want to accomplish so far:
 
  given an input of the form
 
  D1,T1,a1,b1,c1,d1,...,R1
  D2,T2,a2,b2,c2,d2,...,R2
  D3,T3,a3,b3,c3,d3,...,R3
  D4,T4,a4,b4,c4,d4,...,R4
  D5,T5,a5,b5,c5,d5,...,R5
 
  (the ... mean that an  arbitrary number of columns can follow)
 
  You want to group lines by n at a time, keeping the D and T column
  from the first line of each group, and keeping the R column from the
  last line of the group, so for example with n=3 we would have:
 
  D1,T1,a1,b1,c1,d1,...a2,b2,c2,d2,...a3,b3,c3,d3,...R3
  D1,T1,a2,b2,c2,d2,...a3,b3,c3,d3,...a4,b4,c4,d4,...R4
  D1,T1,a3,b3,c3,d3,...a4,b4,c4,d4,...a5,b5,c5,d5,...R5
 
  (and you're right, that produces an output that is roughly n times
  the size of the original file)
 
  Now, in addition to that, you also want to drop an arbitrary number
  of columns in the a,b,c... group. So for example, you want to drop
  columns 2 and 3 (b and c in the example), so you'd end up with
  something like
 
  D1,T1,a1,d1,...a2,d2,...a3,d3,...R3
  D1,T1,a2,d2,...a3,d3,...a4,d4,...R4
  D1,T1,a3,d3,...a4,d4,...a5,d5,...R5
 
  Please confirm that my understanding is correct, so I can come up
  with some code to do that.

 Perfectly correct for all the data rows.

 For the header I now see that we have a slightly harder job. What we'd
 need to do is read the first line of the file, duplicate it N times,
 and then drop the same columns as we drop in the rows. The problem is
 that now I have the same header value for N columns which won't make
 sense to the tool that uses this data. If we could read the header and
 then automatically postpend the number N to each duplicated name. (or
 some string like _N)

 So in the last example the header would be something like

 D,T,a,b,c,d,...,R

 in the original file, and would become

 D,T,a_1,b_1,c_1,d_1,...a_2,b_2,c_2,d_2,...a_3,b_3,c_3,d_3,...R

 and

 D,T,a_1,d_1,...a_2,d_2,...a_3,d_3,...R

 respectively for the two sample outputs above.

 Maybe better would be a separate small program to do the header part
 and then this program could read that header and make it the first
 line of the output file. My worry is that when this data file becomes
 very large - say 1GB or more of data - I probably cannot open the file
 with vi to edit the header. Better if I could put the header in it's
 own file. That file would be 1 line long. I could check it for the
 name edits, make sure it's right, and then the program you are so
 kindly building would just read it, cut out columns, and put it at the
 start of the new large file.

 Does that make sense?

 Maybe, but while we're at it my personal preference would be to just add
 header handling to the existing program. Here's the revised code (also
 cleaned up a bit and more structured):

 # returns a ,a1,b1,c1,.. line
 # drops unwanted columns
 function do_line(num) {
  line=
  for(j=3;jNF;j++)
    if(!((j-2) in dc)){line=line OFS $j (num0?_num:)}
  return line
 }

 # dcols is a string like '2,3,4,7' with a comma separated values list of
 # the columns to drop (first data column after date/time is column 1
 # here)

 BEGIN {FS=OFS=,
       t=split(dcols,tmp,/,/)
       for(i=1;i=t;i++)dc[tmp[i]]
      }

 # process the header
 NR==1{
  l=$1 OFS $2;
  for(i=1;i=n;i++)l=l do_line(i)
  l=l OFS $NF
  print l
 }

 NR=2{
  for(i=1;in;i++){
    s[i]=s[i+1]
    dt[i]=dt[i+1]
  }

  dt[n]=$1 OFS $2
  s[n]=do_line(-1)

  if(NR=n+1){
    l=dt[1]
    for(i=1;i=n;i++)l=l s[i]
    l=l OFS $NF
    print l
  }
 }'

 I think two examples are enough here to demonstrate how it works (blank
 lines between commands added for clarity - remember first line is header
 line):

 $ cat file.csv
 D,T,a,b,c,d,R
 D1,T1,a1,b1,c1,d1,R1
 D2,T2,a2,b2,c2,d2,R2
 D3,T3,a3,b3,c3,d3,R3
 D4,T4,a4,b4,c4,d4,R4
 D5,T5,a5,b5,c5,d5,R5
 D6,T6,a6,b6,c6,d6,R6

 $ awk -v n=3 -f program.awk file.csv
 D,T,a_1,b_1,c_1,d_1,a_2,b_2,c_2,d_2,a_3,b_3,c_3,d_3,R
 D1,T1,a1,b1,c1,d1,a2,b2,c2,d2,a3,b3,c3,d3,R3
 D2,T2,a2,b2,c2,d2,a3,b3,c3,d3,a4,b4,c4,d4,R4
 D3,T3,a3,b3,c3,d3,a4,b4,c4,d4,a5,b5,c5,d5,R5
 D4,T4,a4,b4,c4,d4,a5,b5,c5,d5,a6,b6,c6,d6,R6

 $ awk -v n=3 -v dcols='1,3' -f program.awk file.csv
 D,T,b_1,d_1,b_2,d_2,b_3,d_3,R
 D1,T1,b1,d1,b2,d2,b3,d3,R3
 D2,T2,b2,d2,b3,d3,b4,d4,R4
 D3,T3,b3,d3,b4,d4,b5,d5,R5
 D4,T4,b4,d4,b5,d5,b6,d6,R6

 The program still assumes that first two columns and last column must be
 excluded from the general processing, and just prepended and appended to
 the output lines. That can be changed though.


Excellent! Worked identically for me once I removed an extra single
quote after the last brace that somehow crept into your copy/paste in
this email. Really good stuff.

I tried it on a 40MB data file from TradeStation with 6 years of 5
minute Dow Futures 

Re: [gentoo-user] [OT] - command line read *.csv create new file

2009-02-23 Thread Etaoin Shrdlu
On Monday 23 February 2009, 00:31, Mark Knecht wrote:

 Yeah, that's probably almost usable as it is . I tried it with n=3 and
 n=10. Worked both times just fine. The initial issue might be (as with
 Willie's sed code) that the first line wasn't quite right and required
 some hand editing. I'd prefer not to have to hand edit anything as the
 files are large and that step will be slow. I can work on that.

But then could you paste an example of such line, so we can see it? The 
first line was not special in the sample you posted...

 As per the message to Willie it would be nice to be able to drop
 columns out but technically I suppose it's not really required. All of
 this is going into another program which must at some level understand
 what the columns are. If I have extra dates and don't use them that's
 probably workable.

Anyway, it's not difficult to add that feature:

BEGIN { FS=OFS=,}
{
  r=$NF;NF--
  for(i=1;in;i++){
s[i]=s[i+1]
dt[i]=dt[i+1]
if((NR=n)(i==1))printf %s%s,dt[1],OFS
if(NR=n)printf %s%s,s[i],OFS
  }
  sep=dt[n]=;for(i=1;i=dropcol;i++){dt[n]=dt[n] sep $i;sep=OFS}
  sub(^([^,]*,){dropcol},)
  s[n]=$0
  if(NR=n)printf %s,%s\n, s[n],r
}

There is a new variable dropcol which contains the number of columns to 
drop. Also, for the above to work, you must add the --re-interval 
command line switch to awk, eg 

awk --re-interval -v n=4 -v dropcol=2 -f program.awk datafile.csv

 The down side is the output file is 10x larger than the input file -
 roughly - and my current input files are 40-60MB so the output files
 will be 600MB. Not huge but if they grew too much more I might get
 beyond what a single file can be on ext3, right? Isn't that 2GB or so?

That is strange, the output file could be bigger but not by that 
factor...if you don't mind, again could you paste a sample input file 
(maybe just some lines, to get an idea...)?



Re: [gentoo-user] [OT] - command line read *.csv create new file

2009-02-23 Thread Mark Knecht
On Mon, Feb 23, 2009 at 1:57 AM, Etaoin Shrdlu shr...@unlimitedmail.org wrote:
 On Monday 23 February 2009, 00:31, Mark Knecht wrote:

 Yeah, that's probably almost usable as it is . I tried it with n=3 and
 n=10. Worked both times just fine. The initial issue might be (as with
 Willie's sed code) that the first line wasn't quite right and required
 some hand editing. I'd prefer not to have to hand edit anything as the
 files are large and that step will be slow. I can work on that.

 But then could you paste an example of such line, so we can see it? The
 first line was not special in the sample you posted...

 As per the message to Willie it would be nice to be able to drop
 columns out but technically I suppose it's not really required. All of
 this is going into another program which must at some level understand
 what the columns are. If I have extra dates and don't use them that's
 probably workable.

 Anyway, it's not difficult to add that feature:

 BEGIN { FS=OFS=,}
 {
  r=$NF;NF--
  for(i=1;in;i++){
s[i]=s[i+1]
dt[i]=dt[i+1]
if((NR=n)(i==1))printf %s%s,dt[1],OFS
if(NR=n)printf %s%s,s[i],OFS
  }
  sep=dt[n]=;for(i=1;i=dropcol;i++){dt[n]=dt[n] sep $i;sep=OFS}
  sub(^([^,]*,){dropcol},)
  s[n]=$0
  if(NR=n)printf %s,%s\n, s[n],r
 }

 There is a new variable dropcol which contains the number of columns to
 drop. Also, for the above to work, you must add the --re-interval
 command line switch to awk, eg

 awk --re-interval -v n=4 -v dropcol=2 -f program.awk datafile.csv

Thanks. I'll give that a try later today. I also like Willie's idea
about using cut. That seems pretty flexible without any programming.


 The down side is the output file is 10x larger than the input file -
 roughly - and my current input files are 40-60MB so the output files
 will be 600MB. Not huge but if they grew too much more I might get
 beyond what a single file can be on ext3, right? Isn't that 2GB or so?

 That is strange, the output file could be bigger but not by that
 factor...if you don't mind, again could you paste a sample input file
 (maybe just some lines, to get an idea...)?



I'm attaching a small (100 line) data file out of TradeStation. Zipped
it's about 2K. It should expand to about 10K. When I run the command
to get 10 lines put together it works correctly and gives me a file
with 91 lines and about 100K in size. (I.e. - 10x on my disk.)

awk -v n=10 -f awkScript1.awk awkDataIn.csv awkDataOut.csv

No mangling of the first line - that must have been something earlier
I guess. Sorry for the confusion on that front.

One other item has come up as I start to play with this farther down
the tool chain. I want to use this data in either R or RapidMiner to
data mine for patterns. Both of those tools are easier to use if the
first line in the file has column titles. I had originally asked
TradeStation not to output the column titles but if I do then for the
first line of our new file I should actually copy the first line of
the input file N times. Something like

For i=1; read line, write N times, write \n

and then

for i=2 do what we're doing right now.

After I did that I could run it through cut and drop whatever columns
I need to drop, I think... ;-)

This is great help from you all. As someone who doesn't really program
or use the command line too much it's a big advantage. Thanks!

Cheers,
Mark


awkDataIn.csv.bz2
Description: BZip2 compressed data


awkScript1.awk
Description: Binary data


Re: [gentoo-user] [OT] - command line read *.csv create new file

2009-02-23 Thread Etaoin Shrdlu
On Monday 23 February 2009, 17:05, Mark Knecht wrote:

 I'm attaching a small (100 line) data file out of TradeStation. Zipped
 it's about 2K. It should expand to about 10K. When I run the command
 to get 10 lines put together it works correctly and gives me a file
 with 91 lines and about 100K in size. (I.e. - 10x on my disk.)

 awk -v n=10 -f awkScript1.awk awkDataIn.csv awkDataOut.csv

 No mangling of the first line - that must have been something earlier
 I guess. Sorry for the confusion on that front.

 One other item has come up as I start to play with this farther down
 the tool chain. I want to use this data in either R or RapidMiner to
 data mine for patterns. Both of those tools are easier to use if the
 first line in the file has column titles. I had originally asked
 TradeStation not to output the column titles but if I do then for the
 first line of our new file I should actually copy the first line of
 the input file N times. Something like

 For i=1; read line, write N times, write \n

 and then

 for i=2 do what we're doing right now.

That is actually accomplished just by adding a bit of code:

BEGIN {FS=OFS=,}

NR==1{for(i=1;i=n;i++){printf %s%s, sep, $0;sep=OFS};print} # header
NR=2{
  r=$NF;NF--
  for(i=1;in;i++){
    s[i]=s[i+1]
    dt[i]=dt[i+1]
    if((NR=n+1)(i==1))printf %s%s,dt[1],OFS
    if(NR=n+1)printf %s%s,s[i],OFS
  }
  sep=dt[n]=;for(i=1;i=dropcol;i++){dt[n]=dt[n] sep $i;sep=OFS}
  sub(^([^,]*,){dropcol},)
  s[n]=$0
  if(NR=n+1)printf %s,%s\n, s[n],r
}

Note that no column is dropped from the header. If you need to do that, 
just tell us how you want to do that.



Re: [gentoo-user] [OT] - command line read *.csv create new file

2009-02-23 Thread Mark Knecht
On Mon, Feb 23, 2009 at 2:18 PM, Etaoin Shrdlu shr...@unlimitedmail.org wrote:
 On Monday 23 February 2009, 17:05, Mark Knecht wrote:

 I'm attaching a small (100 line) data file out of TradeStation. Zipped
 it's about 2K. It should expand to about 10K. When I run the command
 to get 10 lines put together it works correctly and gives me a file
 with 91 lines and about 100K in size. (I.e. - 10x on my disk.)

 awk -v n=10 -f awkScript1.awk awkDataIn.csv awkDataOut.csv

 No mangling of the first line - that must have been something earlier
 I guess. Sorry for the confusion on that front.

 One other item has come up as I start to play with this farther down
 the tool chain. I want to use this data in either R or RapidMiner to
 data mine for patterns. Both of those tools are easier to use if the
 first line in the file has column titles. I had originally asked
 TradeStation not to output the column titles but if I do then for the
 first line of our new file I should actually copy the first line of
 the input file N times. Something like

 For i=1; read line, write N times, write \n

 and then

 for i=2 do what we're doing right now.

 That is actually accomplished just by adding a bit of code:

 BEGIN {FS=OFS=,}

 NR==1{for(i=1;i=n;i++){printf %s%s, sep, $0;sep=OFS};print} # header
 NR=2{
   r=$NF;NF--
   for(i=1;in;i++){
 s[i]=s[i+1]
 dt[i]=dt[i+1]
 if((NR=n+1)(i==1))printf %s%s,dt[1],OFS
 if(NR=n+1)printf %s%s,s[i],OFS
   }
   sep=dt[n]=;for(i=1;i=dropcol;i++){dt[n]=dt[n] sep $i;sep=OFS}
   sub(^([^,]*,){dropcol},)
   s[n]=$0
   if(NR=n+1)printf %s,%s\n, s[n],r
 }

 Note that no column is dropped from the header. If you need to do that,
 just tell us how you want to do that.


thanks. that's a good add.

If I drop columns - and I do need to - then something like how cut
works would be good, but it needs to repeat across all the rows being
used. For instance, if I'm dropping columns 6  12 from a 20 column
wide data set, then I'm dropping 6  12 from all N lines. This is
where using cut after the line is built is difficult as I'm forced to
figure out a list like 6,12,26,32,46,52, etc. Easy to make a mistake
doing that. If I could say something like Drop 6  12 from all rows,
and 1  2 from all rows higher than the first that make up this new
line then that would be great. That's a lot to ask though.

D1,T1,A1,B1,C1,D1,
D2,T2,A2,B2,C2,D2,
D3,T3,A3,B3,C3,D3,
D4,T4,A4,B4,C4,D4,
D5,T5,A5,B5,C5,D5,

In the data above if I drop column A, then I drop it for all rows.
(For instance, A contains 0 and isn't necessary, etc.)  Assuming 3
wide I'd get

D1,T1,B1,C1,D1,B2,C2,D2,B3,C3,D3
D2,T2,B2,C2,D2,B3,C3,D3,B4,C4,D4
D3,T3,B3,C3,D3,B4,C4,D4,B5,C5,D5

Making that completely flexible - where I can drop 4 or 5 random
columns - is probably a bit too much work. On the other hand maybe
sending it to cut as part of the whole process, line by lone or
something, is more reasonable? I don't know.

I found a web site to study awk so I'm starting to see more or less
how your example works when I have the code in front of me. Creating
the code out of thin air might be a bit of a stretch for me at this
point though.

thanks for your help!

Cheers,
Mark



Re: [gentoo-user] [OT] - command line read *.csv create new file

2009-02-22 Thread Etaoin Shrdlu
On Sunday 22 February 2009, 20:06, Mark Knecht wrote:
 Hi,
Very off topic other than I'd do this on my Gentoo box prior to
 using R on my Gentoo box. Please ignore if not of interest.

I've got a really big data file in essentially a *.csv format.
 (comma delimited) I need to scan this file and create a new output
 file. I'm wondering if there is a reasonably easy command line way of
 doing this using something like sed or awk which I know nothing about.
 Thanks in advance.

The basic idea goes something like this:

 1) The input file might look this the following where some of it is
 attributes (shown as letters) and other parts are results. (shown as
 numbers)

 A,B,C,D,1
 E,F,G,H,2
 I,J,K,L,3
 M,N,O,P,4
 Q,R,S,T,5
 U,V,W,X,6

Are the results always in the last field, and only a single field?
Is the total number of fields per line always fixed?

 2) From the above data input file I want to take the attributes from a
 few preceeding lines (say 3 in this example) and write them to the
 output file along with the result on the last of the 3 lines. The
 output file might look like this:

 A,B,C,D,E,F,G,H,I,J,K,L,3
 E,F,G,H,I,J,K,L,M,N,O,P,4
 I,J,K,L,M,N,O,P,Q,R,S,T,5
 M,N,O,P,Q,R,S,T,U,V,W,X,6

Is the number of lines you pick for the operation always 3 or can it 
vary? And, once you choose a number n of lines, should the whole file be 
processed concatenating n lines at a time, and the resulting single line 
be ended with the result of the nth line? in other words, does the 
following hold for the output format:

concatenation of attributes of lines 1..n result of line n
concatenation of attributes of lines 2..n+1 result of line n+1
concatenation of attributes of lines 3..n+2 result of line n+1
concatenation of attributes of lines 4..n+3 result of line n+1
...

With answers to the above questions, it's probably possible to hack 
together a solution.



Re: [gentoo-user] [OT] - command line read *.csv create new file

2009-02-22 Thread Willie Wong
On Sun, Feb 22, 2009 at 11:06:31AM -0800, Penguin Lover Mark Knecht squawked:
I've got a really big data file in essentially a *.csv format.
 (comma delimited) I need to scan this file and create a new output
 file. I'm wondering if there is a reasonably easy command line way of
 doing this using something like sed or awk which I know nothing about.
 Thanks in advance.

Definitely more than doable in sed or awk. If you want a reference
book, try http://oreilly.com/catalog/9781565922259/

Unfortunately I haven't used awk in the longest time and can't
remember how it will go. The following sed recipe may work, modulo
some small modifications

The basic idea goes something like this:
 
 1) The input file might look this the following where some of it is
 attributes (shown as letters) and other parts are results. (shown as
 numbers)
 
 A,B,C,D,1
 E,F,G,H,2
 I,J,K,L,3
 M,N,O,P,4
 Q,R,S,T,5
 U,V,W,X,6
 
 2) From the above data input file I want to take the attributes from a
 few preceeding lines (say 3 in this example) and write them to the
 output file along with the result on the last of the 3 lines. The
 output file might look like this:
 
 A,B,C,D,E,F,G,H,I,J,K,L,3
 E,F,G,H,I,J,K,L,M,N,O,P,4
 I,J,K,L,M,N,O,P,Q,R,S,T,5
 M,N,O,P,Q,R,S,T,U,V,W,X,6
 
 3) This must be done as a read/process/write operation of some sort
 because the input file may be far larger than system memory.
 (Currently it isn't, but it likely will eventually be.)
 
 4) In my example above I suggested that there is a single result but
 their may be more than one. (Don't know yet.) I showed 3 lines but
 might be doing 10. I don't know. It's important to me to pick a
 moderately flexible way of dealing with this as the order of columns
 and number of results will likely change over time and I'll certainly
 need to adjust.

First create the sedscript

sedscript1:
--
1 {
N
N
}
{
p
D
N
}
--

The first block only hits when the first line of input is read. It
forces it to read the next two lines. 

The second block hits for every pattern space, it prints the three
line blocks, deletes the first line, and reads the next line. 

Now create the sedscript

sedscript2:
--
{
N
N
s/,[^,]\n/,/gp
d
}
--

This reads a three-line block at a time, removes the last field (and
the new line character) from all but the last line, replacing it with
a comma. Then it prints. And then it clears the pattern space. 

So you can do

cat INPUT | sed -f sedscript1 | sed -f sedscript2

should give you what you want. Like I said, the whole thing can
probably be done a lot more eloquently in awk. But my awk-fu is not
what it used to be. 

For a quick reference for sed, try
http://www.grymoire.com/Unix/Sed.html

W
-- 
Ever stop to think, and forget to start again?
Sortir en Pantoufles: up 807 days, 18:51



Re: [gentoo-user] [OT] - command line read *.csv create new file

2009-02-22 Thread Mark Knecht
On Sun, Feb 22, 2009 at 12:15 PM, Etaoin Shrdlu
shr...@unlimitedmail.org wrote:
 On Sunday 22 February 2009, 20:06, Mark Knecht wrote:
 Hi,
Very off topic other than I'd do this on my Gentoo box prior to
 using R on my Gentoo box. Please ignore if not of interest.

I've got a really big data file in essentially a *.csv format.
 (comma delimited) I need to scan this file and create a new output
 file. I'm wondering if there is a reasonably easy command line way of
 doing this using something like sed or awk which I know nothing about.
 Thanks in advance.

The basic idea goes something like this:

 1) The input file might look this the following where some of it is
 attributes (shown as letters) and other parts are results. (shown as
 numbers)

 A,B,C,D,1
 E,F,G,H,2
 I,J,K,L,3
 M,N,O,P,4
 Q,R,S,T,5
 U,V,W,X,6

 Are the results always in the last field, and only a single field?
 Is the total number of fields per line always fixed?

I don't know that for certain yet but I think the results will not
always be in the last field.

The total number of fields per line is always fixed in a given file
but might change from file to file. If it does I'm willing to do minor
edits (heck - I'll do major edits if I have to!!) to get it working.


 2) From the above data input file I want to take the attributes from a
 few preceeding lines (say 3 in this example) and write them to the
 output file along with the result on the last of the 3 lines. The
 output file might look like this:

 A,B,C,D,E,F,G,H,I,J,K,L,3
 E,F,G,H,I,J,K,L,M,N,O,P,4
 I,J,K,L,M,N,O,P,Q,R,S,T,5
 M,N,O,P,Q,R,S,T,U,V,W,X,6

 Is the number of lines you pick for the operation always 3 or can it
 vary? And, once you choose a number n of lines, should the whole file be
 processed concatenating n lines at a time, and the resulting single line
 be ended with the result of the nth line? in other words, does the
 following hold for the output format:

 concatenation of attributes of lines 1..n result of line n
 concatenation of attributes of lines 2..n+1 result of line n+1
 concatenation of attributes of lines 3..n+2 result of line n+1
 concatenation of attributes of lines 4..n+3 result of line n+1

The above diagram is correct when the lines chosen is 3. I suspect
that I might chose 10 or 15 lines once I get real data and do some
testing but that was harder to show in this email. A good design for
me would be a single variable I could set. Once a value is chosen I
want to process every line in the input file the same way. I don't use
5 lines sometimes and 10 lines other times. In a given file it's
always the same number of lines.

 ...

 With answers to the above questions, it's probably possible to hack
 together a solution.

Thanks!

- Mark



Re: [gentoo-user] [OT] - command line read *.csv create new file

2009-02-22 Thread Etaoin Shrdlu
On Sunday 22 February 2009, 23:28, Mark Knecht wrote:

  concatenation of attributes of lines 1..n result of line n
  concatenation of attributes of lines 2..n+1 result of line n+1
  concatenation of attributes of lines 3..n+2 result of line n+1
  concatenation of attributes of lines 4..n+3 result of line n+1

 The above diagram is correct when the lines chosen is 3. I suspect
 that I might chose 10 or 15 lines once I get real data and do some
 testing but that was harder to show in this email. A good design for
 me would be a single variable I could set. Once a value is chosen I
 want to process every line in the input file the same way. I don't use
 5 lines sometimes and 10 lines other times. In a given file it's
 always the same number of lines.

Ok, try this for a start:

BEGIN { FS=OFS=,}

{
  r=$NF;NF--
  for(i=1;in;i++){
s[i]=s[i+1]
if(NR=n)printf %s%s,s[i],OFS
  }
  s[n]=$0;if(NR=n)printf %s,%s\n, s[n],r
}

Save the above code in a file (eg, program.awk) and run it with

awk -v n=3 -f program.awk datafile.csv

where the n=3 part is to  be replaced with the actual number of lines 
you want to group (eg, n=5, n=4, etc.)

With your sample input and n=3, the above awk program produces the output 
you show.



Re: [gentoo-user] [OT] - command line read *.csv create new file

2009-02-22 Thread Mark Knecht
On Sun, Feb 22, 2009 at 12:59 PM, Willie Wong ww...@princeton.edu wrote:
 On Sun, Feb 22, 2009 at 11:06:31AM -0800, Penguin Lover Mark Knecht squawked:
I've got a really big data file in essentially a *.csv format.
 (comma delimited) I need to scan this file and create a new output
 file. I'm wondering if there is a reasonably easy command line way of
 doing this using something like sed or awk which I know nothing about.
 Thanks in advance.

 Definitely more than doable in sed or awk. If you want a reference
 book, try http://oreilly.com/catalog/9781565922259/

 Unfortunately I haven't used awk in the longest time and can't
 remember how it will go. The following sed recipe may work, modulo
 some small modifications

The basic idea goes something like this:

 1) The input file might look this the following where some of it is
 attributes (shown as letters) and other parts are results. (shown as
 numbers)

 A,B,C,D,1
 E,F,G,H,2
 I,J,K,L,3
 M,N,O,P,4
 Q,R,S,T,5
 U,V,W,X,6

 2) From the above data input file I want to take the attributes from a
 few preceeding lines (say 3 in this example) and write them to the
 output file along with the result on the last of the 3 lines. The
 output file might look like this:

 A,B,C,D,E,F,G,H,I,J,K,L,3
 E,F,G,H,I,J,K,L,M,N,O,P,4
 I,J,K,L,M,N,O,P,Q,R,S,T,5
 M,N,O,P,Q,R,S,T,U,V,W,X,6

 3) This must be done as a read/process/write operation of some sort
 because the input file may be far larger than system memory.
 (Currently it isn't, but it likely will eventually be.)

 4) In my example above I suggested that there is a single result but
 their may be more than one. (Don't know yet.) I showed 3 lines but
 might be doing 10. I don't know. It's important to me to pick a
 moderately flexible way of dealing with this as the order of columns
 and number of results will likely change over time and I'll certainly
 need to adjust.

 First create the sedscript

 sedscript1:
 --
 1 {
N
N
 }
 {
p
D
N
 }
 --

 The first block only hits when the first line of input is read. It
 forces it to read the next two lines.

 The second block hits for every pattern space, it prints the three
 line blocks, deletes the first line, and reads the next line.

 Now create the sedscript

 sedscript2:
 --
 {
N
N
s/,[^,]\n/,/gp
d
 }
 --

 This reads a three-line block at a time, removes the last field (and
 the new line character) from all but the last line, replacing it with
 a comma. Then it prints. And then it clears the pattern space.

 So you can do

 cat INPUT | sed -f sedscript1 | sed -f sedscript2

 should give you what you want. Like I said, the whole thing can
 probably be done a lot more eloquently in awk. But my awk-fu is not
 what it used to be.

 For a quick reference for sed, try
 http://www.grymoire.com/Unix/Sed.html

 W
 --
 Ever stop to think, and forget to start again?
 Sortir en Pantoufles: up 807 days, 18:51

Thanks Willie. That's a good start. The first two lines out were
mangled but that's totally cool. I can deal with that by hand.

There are two places where I'd like to improve things which probably
apply to the awk code Etaoin just sent me also. Both have to do with
excluding columns but in different ways.

1) My actual input data starts with two fields which date  time. For
lines 2  3 I need exclude the 2nd  3rd date  time from the output
corresponding to line 1, so these 3 lines:

Date1,Time1,A,B,C,D,0
Date2,Time2,E,F,G,H,1
Date3,Time3,I,J,K,L,2

should generate

Date1,Time1,A,B,C,D,E,F,G,H,,I,J,K,L,2

Essentially Date  Time from line 1, results from line 3.

2) The second is that possibly I don't need attribute G in my output
file. I'm thinking that possibly a 3rd sed script that counts a
certain number of commas and then doesn't copy up through the next
comma? That's messy in the sense that I probably need to drop 10-15
columns out as my real data is maybe 100 fields wide so I'd have 10-15
addition scripts which is too much of a hack to be maintainable.
Anyway, I appreciate the ideas. What you sent worked great.

I suspect this is somehow similar to what you did in the second
script? I'll go play around and see if I can figure that out.

In reality I'm not sure yet whether the results can be guaranteed to
be at the end in the real file, and probably there will be more than
one result column although if I have to I might be able to take care
of combining two results into a single value at the data sounce if
necessary.

Great help! Thanks!

Cheers,
Mark



Re: [gentoo-user] [OT] - command line read *.csv create new file

2009-02-22 Thread Mark Knecht
On Sun, Feb 22, 2009 at 2:57 PM, Etaoin Shrdlu shr...@unlimitedmail.org wrote:
 On Sunday 22 February 2009, 23:28, Mark Knecht wrote:

  concatenation of attributes of lines 1..n result of line n
  concatenation of attributes of lines 2..n+1 result of line n+1
  concatenation of attributes of lines 3..n+2 result of line n+1
  concatenation of attributes of lines 4..n+3 result of line n+1

 The above diagram is correct when the lines chosen is 3. I suspect
 that I might chose 10 or 15 lines once I get real data and do some
 testing but that was harder to show in this email. A good design for
 me would be a single variable I could set. Once a value is chosen I
 want to process every line in the input file the same way. I don't use
 5 lines sometimes and 10 lines other times. In a given file it's
 always the same number of lines.

 Ok, try this for a start:

 BEGIN { FS=OFS=,}

 {
  r=$NF;NF--
  for(i=1;in;i++){
s[i]=s[i+1]
if(NR=n)printf %s%s,s[i],OFS
  }
  s[n]=$0;if(NR=n)printf %s,%s\n, s[n],r
 }

 Save the above code in a file (eg, program.awk) and run it with

 awk -v n=3 -f program.awk datafile.csv

 where the n=3 part is to  be replaced with the actual number of lines
 you want to group (eg, n=5, n=4, etc.)

 With your sample input and n=3, the above awk program produces the output
 you show.



Yeah, that's probably almost usable as it is . I tried it with n=3 and
n=10. Worked both times just fine. The initial issue might be (as with
Willie's sed code) that the first line wasn't quite right and required
some hand editing. I'd prefer not to have to hand edit anything as the
files are large and that step will be slow. I can work on that.

As per the message to Willie it would be nice to be able to drop
columns out but technically I suppose it's not really required. All of
this is going into another program which must at some level understand
what the columns are. If I have extra dates and don't use them that's
probably workable.

The down side is the output file is 10x larger than the input file -
roughly - and my current input files are 40-60MB so the output files
will be 600MB. Not huge but if they grew too much more I might get
beyond what a single file can be on ext3, right? Isn't that 2GB or so?

Thanks very much,
Mark



Re: [gentoo-user] [OT] - command line read *.csv create new file

2009-02-22 Thread Willie Wong
On Sun, Feb 22, 2009 at 03:15:09PM -0800, Penguin Lover Mark Knecht squawked:
 1) My actual input data starts with two fields which date  time. For
 lines 2  3 I need exclude the 2nd  3rd date  time from the output
 corresponding to line 1, so these 3 lines:
 
 Date1,Time1,A,B,C,D,0
 Date2,Time2,E,F,G,H,1
 Date3,Time3,I,J,K,L,2
 
 should generate
 
 Date1,Time1,A,B,C,D,E,F,G,H,,I,J,K,L,2
 
 Essentially Date  Time from line 1, results from line 3.
 
 2) The second is that possibly I don't need attribute G in my output
 file. I'm thinking that possibly a 3rd sed script that counts a
 certain number of commas and then doesn't copy up through the next
 comma? That's messy in the sense that I probably need to drop 10-15
 columns out as my real data is maybe 100 fields wide so I'd have 10-15
 addition scripts which is too much of a hack to be maintainable.
 Anyway, I appreciate the ideas. What you sent worked great.
 

For both of these cases, since you are dropping columns and not
re-organizing, you'd have a much easier time just piping the command
through cut. Try 'man cut' (it is only a few hundred words) for
usage. But with the sample you gave me, you just need to post process
with

 | cut -d , -f 1-6,9,10,12,15-

and the Date2, Time2, G, Date3, Time3 columns will be dropped. 

As to your problem with the first two lines being mangled: I suspect
that the first two lines were formatted differently? Maybe stray
control characters got into your file or maybe there are leading
spaces? It's bizarre for both Etaoin's and my scripts to
coincidentally mess up the same lines. 

(Incidentally, where did you get the csv files from? When I worked in
a physics labs and collected data, I found that a lot of times the
processing of data using basic command-line tools like sed, bash,
perl, and bc can be done a lot more quickly if the initial datasets were
formatted in a sensible fashion. Of course there are times when such
luxury cannot be afforded.)

Best, 

W
-- 
What's the Lagrangian for a suction dart?
~DeathMech, Some Student. P-town PHY 205
Sortir en Pantoufles: up 807 days, 23:29



Re: [gentoo-user] [OT] - command line read *.csv create new file

2009-02-22 Thread Mark Knecht
On Sun, Feb 22, 2009 at 4:57 PM, Willie Wong ww...@princeton.edu wrote:
 On Sun, Feb 22, 2009 at 03:15:09PM -0800, Penguin Lover Mark Knecht squawked:
 1) My actual input data starts with two fields which date  time. For
 lines 2  3 I need exclude the 2nd  3rd date  time from the output
 corresponding to line 1, so these 3 lines:

 Date1,Time1,A,B,C,D,0
 Date2,Time2,E,F,G,H,1
 Date3,Time3,I,J,K,L,2

 should generate

 Date1,Time1,A,B,C,D,E,F,G,H,,I,J,K,L,2

 Essentially Date  Time from line 1, results from line 3.

 2) The second is that possibly I don't need attribute G in my output
 file. I'm thinking that possibly a 3rd sed script that counts a
 certain number of commas and then doesn't copy up through the next
 comma? That's messy in the sense that I probably need to drop 10-15
 columns out as my real data is maybe 100 fields wide so I'd have 10-15
 addition scripts which is too much of a hack to be maintainable.
 Anyway, I appreciate the ideas. What you sent worked great.


 For both of these cases, since you are dropping columns and not
 re-organizing, you'd have a much easier time just piping the command
 through cut. Try 'man cut' (it is only a few hundred words) for
 usage. But with the sample you gave me, you just need to post process
 with

  | cut -d , -f 1-6,9,10,12,15-

 and the Date2, Time2, G, Date3, Time3 columns will be dropped.

Thanks. I'll investigate that tomorrow.


 As to your problem with the first two lines being mangled: I suspect
 that the first two lines were formatted differently? Maybe stray
 control characters got into your file or maybe there are leading
 spaces? It's bizarre for both Etaoin's and my scripts to
 coincidentally mess up the same lines.

 (Incidentally, where did you get the csv files from? When I worked in
 a physics labs and collected data, I found that a lot of times the
 processing of data using basic command-line tools like sed, bash,
 perl, and bc can be done a lot more quickly if the initial datasets were
 formatted in a sensible fashion. Of course there are times when such
 luxury cannot be afforded.)

They are primarialy coming from TradeStation. The data that I'm
working with is stock pricing data along with technical indicators
coming off of charts. Unfortunatelly I don't seem to have any control
at all as to the order that the columns show up. It doesn't seem to be
based on how I build the chart and certain things on the chart I don't
need are still output to the file. It's pretty much take 100% of
what's on the chart or take nothing.

Fortunately the csv files are very good in terms of not dropping out
data. At least every row has all the data.

Cheers,
Mark


 Best,

 W
 --
 What's the Lagrangian for a suction dart?
 ~DeathMech, Some Student. P-town PHY 205
 Sortir en Pantoufles: up 807 days, 23:29





Re: [gentoo-user] [OT] - command line read *.csv create new file

2009-02-22 Thread Paul Hartman
On Sun, Feb 22, 2009 at 5:31 PM, Mark Knecht markkne...@gmail.com wrote:
 The down side is the output file is 10x larger than the input file -
 roughly - and my current input files are 40-60MB so the output files
 will be 600MB. Not huge but if they grew too much more I might get
 beyond what a single file can be on ext3, right? Isn't that 2GB or so?

The maximum file size for ext3 depends on the block size you're using,
but I believe with the default settings it is 2TB max, not 2GB. I have
personally had files as large as 26GB on ext3 without issues.

You could always pipe the whole operation it through gzip/bzip2/lzma
or similar if you want the file on disk to be much smaller.