Re: Fast diff command for large files?

2005-11-07 Thread Kirk Strauser
On Sunday 06 November 2005 07:39, Andrew P. wrote:

 Note, that the difference must be kept in RAM, so it won't work if there 
 are multi-gig diffs, but it will work very fast if the diffs are only 
 10-100Mb, it will work at close to I/O speed if the diff is under 10Mb.  

Thanks, Andrew!  My Python script runs that algorithm in 17 seconds on a 
400MB file with 10% CPU.

For anyone interested, here's my implementation.  Note that the readline() 
method in Python always returns something, even at EOF (at which point you 
get an empty string).  Also, empty strings evaluate as false, which is 
why the if not (oldline or newline): break code exits at the end.

old_records = []
new_records = []

while 1:
oldline, newline = oldfile.readline(), newfile.readline()
if not (oldline or newline):
break
if oldline == newline:
continue

try:
new_records.remove(oldline)
except ValueError:
if oldline:
old_records.append(oldline)

try:
old_records.remove(newline)
except ValueError:
if newline:
new_records.append(newline)

 Hope this gives you some idea.

It did.  It must've been a long work week, because that all seems so obvious 
in retrospect but was completely opaque at the time.  Thanks again!
-- 
Kirk Strauser


pgpd16wKmA9g0.pgp
Description: PGP signature


Re: Fast diff command for large files?

2005-11-07 Thread Kirk Strauser
On Sunday 06 November 2005 22:19, Olivier Nicole wrote:

 if you have access to the legacy/FoxPro application, it should be
 modifed to add a timestamp to each reccord modification.

Don't underestimate the strength of the word legacy.  To be honest, if we 
had the manhours to rewrite it, we'd take the opportunity to run it 
directly against the PostgreSQL server.

What we're gaining out of this system is the ability to migrate our old 
applications at our leisure, and the opportunity to write new applications 
against a real RDBMS that contains the same data.  It's certainly not the 
ideal, but it's working out very well for us.
-- 
Kirk Strauser


pgplHnuNQg71N.pgp
Description: PGP signature


Re: Fast diff command for large files?

2005-11-07 Thread [EMAIL PROTECTED]

Kirk Strauser writes:

Our legacy application runs on FoxPro.  Our web application runs on a 
PostgreSQL database that's a mirror of the FoxPro tables.


I had the same setup a while back.
A few suggestions.
* Add a date/changed field in Foxpro and update.
* If only recent records are updated, only copy those over
* Add a changed flag in Foxpro tables
* Load entire foxpro tables every time and do a delete/reload withing a 
single transaction.


The idea situation is if you can somehow segreate your older, non changeable 
data, and copy every day only records that are recent.. even if they have 
not changed.


What type of system is this? In particular do any record can be modified or 
are only recent records changed?

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fast diff command for large files?

2005-11-07 Thread Kirk Strauser
On Monday 07 November 2005 10:40, [EMAIL PROTECTED] wrote:

 I had the same setup a while back.
 A few suggestions.

Thanks for the tips; unfortunately, any fix that involves touching the 
FoxPro code is basically impossible.  It's not that we *can't*, but that 
the sole FoxPro programmer at our company is completely occupied with other 
projects.

 What type of system is this? In particular do any record can be modified
 or are only recent records changed?

Nope - every line in each table is subject to change.

Here's how our current system works:

1) Copy each FoxPro table file (and associated memo file if one exists) to a 
Unix server via Samba.
2) Run my modified version of the xbase program to convert each table to a 
tab-delimited file that can be loaded into PostgreSQL using the copy 
table command.  These files are named foo.dump, bar.dump, etc.
3) If foo.dump-old exists:
a) Using Andrew's algorithm, get the difference between foo.dump-old and
   foo.dump.  Write these out as a set of delete from ... commands and
   a copy table command.  Pipe this relatively tiny file into the
   psql command to upload the modifications.
  Otherwise:
b) Use the psql command to upload foo.dump
4) mv foo.dump foo.dump-old
5) Profit!

I've already cut the runtime in half.  The next big step is going to be 
getting our Windows admin to install rsync on the fileserver so that we can 
minimize the time spent in step one.  With the exception of the space 
required by keeping the old version of the dump files (step 4), this is 
exceeding all of our performance expectations by a wide margin.

Even better, step 3a cuts the time that the PostgreSQL server has to spend 
committing the new data by several orders of magnitude.  The net effect is 
that our web visitors don't see a noticeable slowdown during the import 
stage.   
-- 
Kirk Strauser


pgpqY5JpRZNQO.pgp
Description: PGP signature


Re: Fast diff command for large files?

2005-11-07 Thread Olivier Nicole
 Don't underestimate the strength of the word legacy.  To be honest, if we=
 had the manhours to rewrite it, we'd take the opportunity to run it=20
 directly against the PostgreSQL server.

 What we're gaining out of this system is the ability to migrate our old=20
 applications at our leisure, and the opportunity to write new applications=

OK, I didn't get that it was a temporary solution (during the
migration period).

My comment was more in the spirit of a permanent solution.

Goo dluck,

Olivier
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fast diff command for large files?

2005-11-06 Thread Kirk Strauser
On Friday 04 November 2005 02:04 pm, you wrote:

 Does the overall order of lines change every time you dump the tables?

No, although an arbitrary number of lines might get deleted.

 If it does/can, then there's a trivial solution (a few lines in perl, or a
 hundred lines in C) that'll make the speed roughly similar to that of I/O.
  

Could you elaborate?  That's been bugging me all weekend.  I know I should 
know this, but I can't quite put my finger on it.
-- 
Kirk Strauser


pgpN5vT0gOSPn.pgp
Description: PGP signature


Re: Fast diff command for large files?

2005-11-06 Thread Andrew P.
On 11/6/05, Kirk Strauser [EMAIL PROTECTED] wrote:
 On Friday 04 November 2005 02:04 pm, you wrote:

  Does the overall order of lines change every time you dump the tables?

 No, although an arbitrary number of lines might get deleted.

  If it does/can, then there's a trivial solution (a few lines in perl, or a
  hundred lines in C) that'll make the speed roughly similar to that of I/O.

 Could you elaborate?  That's been bugging me all weekend.  I know I should
 know this, but I can't quite put my finger on it.
 --
 Kirk Strauser




while (there are more records) {
 a = read (line from old file)
 b = read (line from new file)
 if (a == b) then next
 if (a  b) {
  if (a in new_records) {
   get a out of new_records
   next
  }
  if (b in old_records) {
   get b out of old_records
   next
  }
  put a in old_records
  put b in new_records
}

after that old_records will contain records present in old
file, but not in new file, and new_records will contain
records present in new file, but not old one.

Note, that the difference must be kept in RAM, so it
won't work if there are multi-gig diffs, but it will work
very fast if the diffs are only 10-100Mb, it will work at
close to I/O speed if the diff is under 10Mb.

If the records can be ordered in a known order (e.g.
alphabetically), we don't need to keep anything in
RAM then and make any checks at all. Let's
assume an ascending order (1-2-5-7-31-...):

while (there are more records) {
 a = read (line from old file)
 b = read (line from new file)
 while (a  b) {
  if (a  b) then {
   write a to old_records
   read next a
  }
  if (a  b) then {
   write b to new_records
   read next b
  }
 }
}

If course, you've got to add some checks to
deal with EOF correctly.

Hope this gives you some idea.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fast diff command for large files?

2005-11-06 Thread Olivier Nicole
 We do the mirroring by running a program that dumps the FoxPro
 tables out as tab-delimited files.  Thus far, we'd been using
 PostgreSQL's copy from command to read those files into the
 database.  In reality, though, a very, very small percentage of
 rows in those tables actually change.  So, I wrote a program
 that takes the output of diff and converts it into a series of

I think the problem could be considered another way.

if you have access to the legacy/FoxPro application, it should be
modifed to add a timestamp to each reccord modification.

Then you could only dump those reccords that were modified since the
last change.

That seems to me the only long term viable solution that could sizeup
nicely with your set of data.

Not to mention that once implemented it would be much much faster.

Olivier
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fast diff command for large files?

2005-11-05 Thread Jan Grant
On Fri, 4 Nov 2005, Kirk Strauser wrote:

 thinking out loud
 I wonder if rsync could be modified to output its patches rather than 
 silently applying them to a target file.  It seems to be pretty good at 
 comparing large files quickly...
 /thinking

More thinking out loud: since these are database dumps, they're 
order-independent. So, sort the files into a predictable lexical order, 
then a diff is a linear operation over tonight's file versus last 
night's.


-- 
jan grant, ILRT, University of Bristol. http://www.ilrt.bris.ac.uk/
Tel +44 (0)117 3317661   http://ioctl.org/jan/
Rereleasing dolphins into the wild since 1998.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Fast diff command for large files?

2005-11-04 Thread Kirk Strauser
I need to routinely find the diffs between two multigigabyte text files 
(exporting a set of FoxPro tables to a PostgreSQL database without doing a 
complete dump/reload each time, in case you were wondering).  GNU diff from 
the base system and from ports chokes.  The textproc/2bsd-diff works OK, 
but is glacially slow.

I'm basically looking for something that generates easily-parseable text.  
Since this is a young project, I don't particularly care if the output 
format is different from diff's.  Any suggestions?
-- 
Kirk Strauser
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fast diff command for large files?

2005-11-04 Thread Chuck Swiger

Kirk Strauser wrote:
I need to routinely find the diffs between two multigigabyte text files 
(exporting a set of FoxPro tables to a PostgreSQL database without doing a 
complete dump/reload each time, in case you were wondering).  GNU diff from 
the base system and from ports chokes.  The textproc/2bsd-diff works OK, 
but is glacially slow.


Multigigabyte?  Find another approach to solving the problem, a text-base diff 
is going to require excessive resources and time.  A 64-bit platform with 2 GB 
of RAM  3GB of swap requires ~1000 seconds to diff ~400MB.


--
-Chuck
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fast diff command for large files?

2005-11-04 Thread Kirk Strauser
On Friday 04 November 2005 10:22, Chuck Swiger wrote:

 Multigigabyte?  Find another approach to solving the problem, a text-base
 diff is going to require excessive resources and time.  A 64-bit platform
 with 2 GB of RAM  3GB of swap requires ~1000 seconds to diff ~400MB.

There really aren't many options.  For the patient, here's what's happening:

Our legacy application runs on FoxPro.  Our web application runs on a 
PostgreSQL database that's a mirror of the FoxPro tables.

We do the mirroring by running a program that dumps the FoxPro tables out as 
tab-delimited files.  Thus far, we'd been using PostgreSQL's copy from 
command to read those files into the database.  In reality, though, a very, 
very small percentage of rows in those tables actually change.  So, I wrote 
a program that takes the output of diff and converts it into a series of 
delete and insert commands; benchmarking shows that this is roughly 300 
times faster in our use.

And that's why I need a fast diff.  Even if it takes as long as the database 
bulk loads, we can run it on another server and use 20 seconds of CPU for 
PostgreSQL instead of 45 minutes.  The practical upshot is that the 
database will never get sluggish, even if the other diff server is loaded 
to the gills.
-- 
Kirk Strauser


pgp8crJHkPVTm.pgp
Description: PGP signature


Re: Fast diff command for large files?

2005-11-04 Thread Charles Swiger

On Nov 4, 2005, at 12:29 PM, Kirk Strauser wrote:
Multigigabyte?  Find another approach to solving the problem, a  
text-base
diff is going to require excessive resources and time.  A 64-bit  
platform

with 2 GB of RAM  3GB of swap requires ~1000 seconds to diff ~400MB.


There really aren't many options.  For the patient, here's what's  
happening:

[ ... ]
And that's why I need a fast diff.  Even if it takes as long as the  
database
bulk loads, we can run it on another server and use 20 seconds of  
CPU for

PostgreSQL instead of 45 minutes.  The practical upshot is that the
database will never get sluggish, even if the other diff server  
is loaded

to the gills.


OK, but even if only one line out of 1000 changes, you still can't  
make either diff or Colin Percival's bsdiff run on gigabyte sized  
files and have it fit into MAXDSIZE on 32-bit address space.  From  
the latter's website:


bsdiff is quite memory-hungry. It requires max(17*n,9*n+m)+O(1)  
bytes of memory, where n is the size of the old file and m is the  
size of the new file.  bspatch requires n+m+O(1) bytes.  bsdiff runs  
in O((n+m) log n) time; on a 200MHz Pentium Pro, building a binary  
patch for a 4MB file takes about 90 seconds. bspatch runs in O(n+m)  
time; on the same machine, applying that patch takes about two seconds.


Some time ago, I wrote a quick test harness for diff here:

   http://www.pkix.net/~chuck/difftest.py

On a 5.4 machine with kern.dfldsiz=1G set in /boot/loader.conf, you  
can only manage to run diff on files up to about 120 MB in size:


31-pi% ./difftest.py -v
INFO: beginning diff trial run with ratio = 100
filea_size=10485760 (aka 10.000 MB)
time=1.370 filea_size=10MB diff_size=818KB
filea_size=15728640 (aka 15.000 MB)
time=2.305 filea_size=15MB diff_size=1229KB
filea_size=23592960 (aka 22.500 MB)
time=5.443 filea_size=22MB diff_size=1844KB
filea_size=35389440 (aka 33.750 MB)
time=7.195 filea_size=33MB diff_size=2768KB
filea_size=53084160 (aka 50.625 MB)
time=16.771 filea_size=50MB diff_size=4163KB
filea_size=79626240 (aka 75.938 MB)
time=43.525 filea_size=75MB diff_size=6257KB
filea_size=119439360 (aka 113.906 MB)
time=78.346 filea_size=113MB diff_size=9MB
filea_size=179159040 (aka 170.859 MB)
diff: memory exhausted
NOTICE: diff exitted with errno 2
time=36.896 filea_size=170MB diff_size=0KB
272.58s real  154.73s user  13.23s system  61%

On a 64-bit SPARC box mentioned above, you can get sizes up to ~400 MB:

[ ... ]
filea_size=119439360 (aka 113.906 MB)
time=140.650 filea_size=115MB diff_size=9MB
filea_size=179159040 (aka 170.859 MB)
time=424.586 filea_size=172MB diff_size=15MB
filea_size=268738560 (aka 256.289 MB)
time=546.334 filea_size=258MB diff_size=22MB
filea_size=403107840 (aka 384.434 MB)
time=957.059 filea_size=388MB diff_size=33MB
filea_size=604661760 (aka 576.650 MB)
diff: memory exhausted
NOTICE: diff exitted with errno 2
time=105.728 filea_size=582MB diff_size=0KB
5610.90s real  3268.63s user  1761.90s system  89%

Roughly, you need about an order of magnitude more RAM or virtual  
memory available then the size of the files you are trying to diff,  
even if the files are very similar.


--
-Chuck

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fast diff command for large files?

2005-11-04 Thread Kirk Strauser
On Friday 04 November 2005 13:39, Charles Swiger wrote:

 OK, but even if only one line out of 1000 changes, you still can't
 make either diff or Colin Percival's bsdiff run on gigabyte sized
 files and have it fit into MAXDSIZE on 32-bit address space.

For the record, textproc/2bsd-diff works fine - it's just slow.

thinking out loud
I wonder if rsync could be modified to output its patches rather than 
silently applying them to a target file.  It seems to be pretty good at 
comparing large files quickly...
/thinking
-- 
Kirk Strauser


pgpdXELVcLCLb.pgp
Description: PGP signature


Re: Fast diff command for large files?

2005-11-04 Thread Andrew P.
On 11/4/05, Kirk Strauser [EMAIL PROTECTED] wrote:
 On Friday 04 November 2005 10:22, Chuck Swiger wrote:

  Multigigabyte?  Find another approach to solving the problem, a text-base
  diff is going to require excessive resources and time.  A 64-bit platform
  with 2 GB of RAM  3GB of swap requires ~1000 seconds to diff ~400MB.

 There really aren't many options.  For the patient, here's what's happening:

 Our legacy application runs on FoxPro.  Our web application runs on a
 PostgreSQL database that's a mirror of the FoxPro tables.

 We do the mirroring by running a program that dumps the FoxPro tables out as
 tab-delimited files.  Thus far, we'd been using PostgreSQL's copy from
 command to read those files into the database.  In reality, though, a very,
 very small percentage of rows in those tables actually change.  So, I wrote
 a program that takes the output of diff and converts it into a series of
 delete and insert commands; benchmarking shows that this is roughly 300
 times faster in our use.

 And that's why I need a fast diff.  Even if it takes as long as the database
 bulk loads, we can run it on another server and use 20 seconds of CPU for
 PostgreSQL instead of 45 minutes.  The practical upshot is that the
 database will never get sluggish, even if the other diff server is loaded
 to the gills.
 --
 Kirk Strauser




Does the overall order of lines change every time
you dump the tables? If not, is there any inexpensive
way to sort them (not alphabetically, but just that
the order stays the same)? If it does/can, then there's
a trivial solution (a few lines in perl, or a hundred
lines in C) that'll make the speed roughly similar
to that of I/O.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]