Dirk Eddelbuettel <edd <at> debian.org> writes:
> Ole Tange <ole <at> tange.dk> writes:
> > If 2 awk scripts both open A, B and C then the last one wins and all
> > data written by the first one is lost.
>
> Plonk. I think that may indeed be the case. I had not tought that through.
> I have to find a tool that does this in append mode.
Well a little "apt-get install gawk-doc" and two seconds of searching lead to
the '>>' operator to append to files ... and tada, it now works.
edd@max:/tmp/parallel$ rm dataSerial/* dataParallel/*
edd@max:/tmp/parallel$
edd@max:/tmp/parallel$ cat data.txt | \
awk -v path=dataSerial '{print $0 > (path "/" $1 ".txt")}'
edd@max:/tmp/parallel$ cat data.txt | \
parallel --pipe -- awk -v path=dataParallel -f script.awk
edd@max:/tmp/parallel$ wc -l dataSerial/*
199762 dataSerial/A.txt
200031 dataSerial/B.txt
200283 dataSerial/C.txt
199845 dataSerial/D.txt
200079 dataSerial/E.txt
1000000 total
edd@max:/tmp/parallel$ wc -l dataParallel/*
199762 dataParallel/A.txt
200031 dataParallel/B.txt
200283 dataParallel/C.txt
199845 dataParallel/D.txt
200079 dataParallel/E.txt
1000000 total
edd@max:/tmp/parallel$
with
edd@max:/tmp/parallel$ cat script.awk
{
print $0 >> (path "/" $1 ".txt")
}
edd@max:/tmp/parallel$
For reference and completeness, the data generator was the R script below:
edd@max:/tmp/parallel$ cat createData.r
#!/usr/bin/Rscript
N <- 1e6
set.seed(42)
df <- data.frame(key=sample(LETTERS[1:5], N, replace=TRUE),
value=rnorm(N))
write.table(df, file="/tmp/parallel/data.txt",
row.names=FALSE, col.names=FALSE, quote=FALSE)
Thanks, Dirk