Re: [R] Data file verification protocol

Barry Rowlingson Wed, 19 Mar 2014 00:34:26 -0700

On Wed, Mar 19, 2014 at 4:03 AM, Wolf, Steven <wolfs...@msu.edu> wrote:


> Hi R users,
>
> This isn't a R-specific issue, per-se, but I thought that this list would
> have some helpful input on this topic.  First, a bit of background.  I am
> working on a project which is interested in following approx 1000 students
> each semester, and collects about 15 different measurements about each
> student.  These are both numeric and text, for example grades in a course,
> race, gender, etc.
>
> I am looking for a verification protocol which can look at a data file and
> see if it has been modified.  Ideally, this should be something that I can
> check the file with to see if the file has been changed or corrupted and
> incorporate into my analysis workflow.  (i.e., every time I look at my
> data, I can run this protocol to ensure the file hasn't changed.)
>

 Operating systems will keep the last modification time of a file, and you
can use the file.info function in R to check that. However, if someone just
opens and re-saves the file without changing it that will usually trigger
an update of the modification time.

 The big question you haven't answered is "has the file been changed since
when?". Since you last ran your analysis? This then looks like a job for
the 'make' utility. You specify rules in a 'Makefile' that specify how to
create "targets" based on "dependencies". For example:

results.txt: data.dat process.R
    Rscript process.R

- says that "results.txt" depends on "data.dat" (your input data) and
"process.R" (your R code that creates results.txt from data.dat), and would
run "Rscript process.R" if data.dat or process.R have a newer modification
time than results.txt. Run twice in rapid succession, this makefile
wouldn't run R the second time because results.txt would be newer then its
dependencies since it was just created.

 Objects within R don't have timestamps, so its not possible to
conditionally run an R function if its parameter objects are newer than the
result object. But if you save R objects as .RData files, you can use
"make" based on the timestamps of the .RData files.

 Alternatively you can just keep recent versions of the file hanging around
(1000x15 is pretty small, even multiplied by another 1000 is still not
exactly Big Data) and compare them. In a unix environment the "cmp" command
will quickly test two files for equality, or if you don't want to store
copies of your file you simply compute a checksum or digest and compare
digests. In a unix environment you'd typically use the "md5sum" command
which spits out a 128-bit (32 character) checksum for its arguments. If the
checksum is different, then the file is different.

 Your use case is still a bit vague - for example you haven't said what the
file format is, or how its being updated.

Barry

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data file verification protocol

Reply via email to