On 6/13/07, Robert Wilkins <[EMAIL PROTECTED]> wrote:
>
> The point is : there are lots of data preparation scenarios where
> large numbers of merges need to be done. This is an example where
> Vilno and SAS are easier to use than the competition. I'm sure an Awk
> programmer can come up with something, but the result would be
> awkward.


Agreed.
In the awk+R scenario, it is clear that the merges are often better done
with R.
My strategy is to use awk only to clean/reformat data into a tabular format
and
do most of the "consolidation" (computations/filtering/merges) in R.  I
suggested to use awk only to perform manipulations that would be more
complex to do within R (especially mutliline records or recors with
optionnal fields). I try to keep the scripts as simple as possible on both
sides



> Certain apsects of Vilno and SAS are a bit more user-friendly:
> > Each column has a variable name, such as "PatientID".
> > Awk uses $1, $2, $3 , as variable names for columns. Not user-friendly.
>
>


In the first lines of awk scripts, I usually assign column numbers to
variables (e.g. "Code=1, time=3") and then access the fields with "$Code",
"$Time"...
Yet, it is true that it is cumbersome, in awk, to use the labels on the
first line of a file as a variable names (my major complain about awk).

I looked at a few examples of  SAS Data step scripts on the Net, and found
that the awk scripts would be very similar (except for merges), but there
may  manipulations which I missed.


> For scanning inconsistently structured ASCII data files, where
> different rows have different column specifications, Awk is a better
> tool.
>
> For data problems that lend themselves to UNIX-style regular
> expressions, Awk, again, is a great tool.



The examples of messy data formats that were described ealier on the list
are good examples where regular expressions will help a lot. In the very
first stage of data inspection, to detect coding "mistakes", awk (sometimes
with the help ot other gnutools such as 'uniq' and 'sort') can be very
efficient.

> The upshot:

> Awk is a hammer.
> Vilno is a screwdriver.

Nice analogy. Using the right tool for the right task is very important.
So awk and vilno seem complementary.
Yet, when R enters into the equation, do you still "need" the three tools?

What we should really compare is the four situations:

R alone
R + awk
R + vilno
R + awk + vilno

and maybe "R + SAS Data step"

and see what scripts are more  elegant (read 'short and understandable')


Best,

Christophe



-- 
Christophe Pallier (http://www.pallier.org)

        [[alternative HTML version deleted]]

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to