Re: Fast diff command for large files?
On Sunday 06 November 2005 07:39, Andrew P. wrote: Note, that the difference must be kept in RAM, so it won't work if there are multi-gig diffs, but it will work very fast if the diffs are only 10-100Mb, it will work at close to I/O speed if the diff is under 10Mb. Thanks, Andrew! My Python script runs that algorithm in 17 seconds on a 400MB file with 10% CPU. For anyone interested, here's my implementation. Note that the readline() method in Python always returns something, even at EOF (at which point you get an empty string). Also, empty strings evaluate as false, which is why the if not (oldline or newline): break code exits at the end. old_records = [] new_records = [] while 1: oldline, newline = oldfile.readline(), newfile.readline() if not (oldline or newline): break if oldline == newline: continue try: new_records.remove(oldline) except ValueError: if oldline: old_records.append(oldline) try: old_records.remove(newline) except ValueError: if newline: new_records.append(newline) Hope this gives you some idea. It did. It must've been a long work week, because that all seems so obvious in retrospect but was completely opaque at the time. Thanks again! -- Kirk Strauser pgpd16wKmA9g0.pgp Description: PGP signature
Re: Fast diff command for large files?
On Sunday 06 November 2005 22:19, Olivier Nicole wrote: if you have access to the legacy/FoxPro application, it should be modifed to add a timestamp to each reccord modification. Don't underestimate the strength of the word legacy. To be honest, if we had the manhours to rewrite it, we'd take the opportunity to run it directly against the PostgreSQL server. What we're gaining out of this system is the ability to migrate our old applications at our leisure, and the opportunity to write new applications against a real RDBMS that contains the same data. It's certainly not the ideal, but it's working out very well for us. -- Kirk Strauser pgplHnuNQg71N.pgp Description: PGP signature
Re: Fast diff command for large files?
Kirk Strauser writes: Our legacy application runs on FoxPro. Our web application runs on a PostgreSQL database that's a mirror of the FoxPro tables. I had the same setup a while back. A few suggestions. * Add a date/changed field in Foxpro and update. * If only recent records are updated, only copy those over * Add a changed flag in Foxpro tables * Load entire foxpro tables every time and do a delete/reload withing a single transaction. The idea situation is if you can somehow segreate your older, non changeable data, and copy every day only records that are recent.. even if they have not changed. What type of system is this? In particular do any record can be modified or are only recent records changed? ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Fast diff command for large files?
On Monday 07 November 2005 10:40, [EMAIL PROTECTED] wrote: I had the same setup a while back. A few suggestions. Thanks for the tips; unfortunately, any fix that involves touching the FoxPro code is basically impossible. It's not that we *can't*, but that the sole FoxPro programmer at our company is completely occupied with other projects. What type of system is this? In particular do any record can be modified or are only recent records changed? Nope - every line in each table is subject to change. Here's how our current system works: 1) Copy each FoxPro table file (and associated memo file if one exists) to a Unix server via Samba. 2) Run my modified version of the xbase program to convert each table to a tab-delimited file that can be loaded into PostgreSQL using the copy table command. These files are named foo.dump, bar.dump, etc. 3) If foo.dump-old exists: a) Using Andrew's algorithm, get the difference between foo.dump-old and foo.dump. Write these out as a set of delete from ... commands and a copy table command. Pipe this relatively tiny file into the psql command to upload the modifications. Otherwise: b) Use the psql command to upload foo.dump 4) mv foo.dump foo.dump-old 5) Profit! I've already cut the runtime in half. The next big step is going to be getting our Windows admin to install rsync on the fileserver so that we can minimize the time spent in step one. With the exception of the space required by keeping the old version of the dump files (step 4), this is exceeding all of our performance expectations by a wide margin. Even better, step 3a cuts the time that the PostgreSQL server has to spend committing the new data by several orders of magnitude. The net effect is that our web visitors don't see a noticeable slowdown during the import stage. -- Kirk Strauser pgpqY5JpRZNQO.pgp Description: PGP signature
Re: Fast diff command for large files?
Don't underestimate the strength of the word legacy. To be honest, if we= had the manhours to rewrite it, we'd take the opportunity to run it=20 directly against the PostgreSQL server. What we're gaining out of this system is the ability to migrate our old=20 applications at our leisure, and the opportunity to write new applications= OK, I didn't get that it was a temporary solution (during the migration period). My comment was more in the spirit of a permanent solution. Goo dluck, Olivier ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Fast diff command for large files?
On Friday 04 November 2005 02:04 pm, you wrote: Does the overall order of lines change every time you dump the tables? No, although an arbitrary number of lines might get deleted. If it does/can, then there's a trivial solution (a few lines in perl, or a hundred lines in C) that'll make the speed roughly similar to that of I/O. Could you elaborate? That's been bugging me all weekend. I know I should know this, but I can't quite put my finger on it. -- Kirk Strauser pgpN5vT0gOSPn.pgp Description: PGP signature
Re: Fast diff command for large files?
On 11/6/05, Kirk Strauser [EMAIL PROTECTED] wrote: On Friday 04 November 2005 02:04 pm, you wrote: Does the overall order of lines change every time you dump the tables? No, although an arbitrary number of lines might get deleted. If it does/can, then there's a trivial solution (a few lines in perl, or a hundred lines in C) that'll make the speed roughly similar to that of I/O. Could you elaborate? That's been bugging me all weekend. I know I should know this, but I can't quite put my finger on it. -- Kirk Strauser while (there are more records) { a = read (line from old file) b = read (line from new file) if (a == b) then next if (a b) { if (a in new_records) { get a out of new_records next } if (b in old_records) { get b out of old_records next } put a in old_records put b in new_records } after that old_records will contain records present in old file, but not in new file, and new_records will contain records present in new file, but not old one. Note, that the difference must be kept in RAM, so it won't work if there are multi-gig diffs, but it will work very fast if the diffs are only 10-100Mb, it will work at close to I/O speed if the diff is under 10Mb. If the records can be ordered in a known order (e.g. alphabetically), we don't need to keep anything in RAM then and make any checks at all. Let's assume an ascending order (1-2-5-7-31-...): while (there are more records) { a = read (line from old file) b = read (line from new file) while (a b) { if (a b) then { write a to old_records read next a } if (a b) then { write b to new_records read next b } } } If course, you've got to add some checks to deal with EOF correctly. Hope this gives you some idea. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Fast diff command for large files?
We do the mirroring by running a program that dumps the FoxPro tables out as tab-delimited files. Thus far, we'd been using PostgreSQL's copy from command to read those files into the database. In reality, though, a very, very small percentage of rows in those tables actually change. So, I wrote a program that takes the output of diff and converts it into a series of I think the problem could be considered another way. if you have access to the legacy/FoxPro application, it should be modifed to add a timestamp to each reccord modification. Then you could only dump those reccords that were modified since the last change. That seems to me the only long term viable solution that could sizeup nicely with your set of data. Not to mention that once implemented it would be much much faster. Olivier ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Fast diff command for large files?
On Fri, 4 Nov 2005, Kirk Strauser wrote: thinking out loud I wonder if rsync could be modified to output its patches rather than silently applying them to a target file. It seems to be pretty good at comparing large files quickly... /thinking More thinking out loud: since these are database dumps, they're order-independent. So, sort the files into a predictable lexical order, then a diff is a linear operation over tonight's file versus last night's. -- jan grant, ILRT, University of Bristol. http://www.ilrt.bris.ac.uk/ Tel +44 (0)117 3317661 http://ioctl.org/jan/ Rereleasing dolphins into the wild since 1998. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Fast diff command for large files?
I need to routinely find the diffs between two multigigabyte text files (exporting a set of FoxPro tables to a PostgreSQL database without doing a complete dump/reload each time, in case you were wondering). GNU diff from the base system and from ports chokes. The textproc/2bsd-diff works OK, but is glacially slow. I'm basically looking for something that generates easily-parseable text. Since this is a young project, I don't particularly care if the output format is different from diff's. Any suggestions? -- Kirk Strauser ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Fast diff command for large files?
Kirk Strauser wrote: I need to routinely find the diffs between two multigigabyte text files (exporting a set of FoxPro tables to a PostgreSQL database without doing a complete dump/reload each time, in case you were wondering). GNU diff from the base system and from ports chokes. The textproc/2bsd-diff works OK, but is glacially slow. Multigigabyte? Find another approach to solving the problem, a text-base diff is going to require excessive resources and time. A 64-bit platform with 2 GB of RAM 3GB of swap requires ~1000 seconds to diff ~400MB. -- -Chuck ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Fast diff command for large files?
On Friday 04 November 2005 10:22, Chuck Swiger wrote: Multigigabyte? Find another approach to solving the problem, a text-base diff is going to require excessive resources and time. A 64-bit platform with 2 GB of RAM 3GB of swap requires ~1000 seconds to diff ~400MB. There really aren't many options. For the patient, here's what's happening: Our legacy application runs on FoxPro. Our web application runs on a PostgreSQL database that's a mirror of the FoxPro tables. We do the mirroring by running a program that dumps the FoxPro tables out as tab-delimited files. Thus far, we'd been using PostgreSQL's copy from command to read those files into the database. In reality, though, a very, very small percentage of rows in those tables actually change. So, I wrote a program that takes the output of diff and converts it into a series of delete and insert commands; benchmarking shows that this is roughly 300 times faster in our use. And that's why I need a fast diff. Even if it takes as long as the database bulk loads, we can run it on another server and use 20 seconds of CPU for PostgreSQL instead of 45 minutes. The practical upshot is that the database will never get sluggish, even if the other diff server is loaded to the gills. -- Kirk Strauser pgp8crJHkPVTm.pgp Description: PGP signature
Re: Fast diff command for large files?
On Nov 4, 2005, at 12:29 PM, Kirk Strauser wrote: Multigigabyte? Find another approach to solving the problem, a text-base diff is going to require excessive resources and time. A 64-bit platform with 2 GB of RAM 3GB of swap requires ~1000 seconds to diff ~400MB. There really aren't many options. For the patient, here's what's happening: [ ... ] And that's why I need a fast diff. Even if it takes as long as the database bulk loads, we can run it on another server and use 20 seconds of CPU for PostgreSQL instead of 45 minutes. The practical upshot is that the database will never get sluggish, even if the other diff server is loaded to the gills. OK, but even if only one line out of 1000 changes, you still can't make either diff or Colin Percival's bsdiff run on gigabyte sized files and have it fit into MAXDSIZE on 32-bit address space. From the latter's website: bsdiff is quite memory-hungry. It requires max(17*n,9*n+m)+O(1) bytes of memory, where n is the size of the old file and m is the size of the new file. bspatch requires n+m+O(1) bytes. bsdiff runs in O((n+m) log n) time; on a 200MHz Pentium Pro, building a binary patch for a 4MB file takes about 90 seconds. bspatch runs in O(n+m) time; on the same machine, applying that patch takes about two seconds. Some time ago, I wrote a quick test harness for diff here: http://www.pkix.net/~chuck/difftest.py On a 5.4 machine with kern.dfldsiz=1G set in /boot/loader.conf, you can only manage to run diff on files up to about 120 MB in size: 31-pi% ./difftest.py -v INFO: beginning diff trial run with ratio = 100 filea_size=10485760 (aka 10.000 MB) time=1.370 filea_size=10MB diff_size=818KB filea_size=15728640 (aka 15.000 MB) time=2.305 filea_size=15MB diff_size=1229KB filea_size=23592960 (aka 22.500 MB) time=5.443 filea_size=22MB diff_size=1844KB filea_size=35389440 (aka 33.750 MB) time=7.195 filea_size=33MB diff_size=2768KB filea_size=53084160 (aka 50.625 MB) time=16.771 filea_size=50MB diff_size=4163KB filea_size=79626240 (aka 75.938 MB) time=43.525 filea_size=75MB diff_size=6257KB filea_size=119439360 (aka 113.906 MB) time=78.346 filea_size=113MB diff_size=9MB filea_size=179159040 (aka 170.859 MB) diff: memory exhausted NOTICE: diff exitted with errno 2 time=36.896 filea_size=170MB diff_size=0KB 272.58s real 154.73s user 13.23s system 61% On a 64-bit SPARC box mentioned above, you can get sizes up to ~400 MB: [ ... ] filea_size=119439360 (aka 113.906 MB) time=140.650 filea_size=115MB diff_size=9MB filea_size=179159040 (aka 170.859 MB) time=424.586 filea_size=172MB diff_size=15MB filea_size=268738560 (aka 256.289 MB) time=546.334 filea_size=258MB diff_size=22MB filea_size=403107840 (aka 384.434 MB) time=957.059 filea_size=388MB diff_size=33MB filea_size=604661760 (aka 576.650 MB) diff: memory exhausted NOTICE: diff exitted with errno 2 time=105.728 filea_size=582MB diff_size=0KB 5610.90s real 3268.63s user 1761.90s system 89% Roughly, you need about an order of magnitude more RAM or virtual memory available then the size of the files you are trying to diff, even if the files are very similar. -- -Chuck ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Fast diff command for large files?
On Friday 04 November 2005 13:39, Charles Swiger wrote: OK, but even if only one line out of 1000 changes, you still can't make either diff or Colin Percival's bsdiff run on gigabyte sized files and have it fit into MAXDSIZE on 32-bit address space. For the record, textproc/2bsd-diff works fine - it's just slow. thinking out loud I wonder if rsync could be modified to output its patches rather than silently applying them to a target file. It seems to be pretty good at comparing large files quickly... /thinking -- Kirk Strauser pgpdXELVcLCLb.pgp Description: PGP signature
Re: Fast diff command for large files?
On 11/4/05, Kirk Strauser [EMAIL PROTECTED] wrote: On Friday 04 November 2005 10:22, Chuck Swiger wrote: Multigigabyte? Find another approach to solving the problem, a text-base diff is going to require excessive resources and time. A 64-bit platform with 2 GB of RAM 3GB of swap requires ~1000 seconds to diff ~400MB. There really aren't many options. For the patient, here's what's happening: Our legacy application runs on FoxPro. Our web application runs on a PostgreSQL database that's a mirror of the FoxPro tables. We do the mirroring by running a program that dumps the FoxPro tables out as tab-delimited files. Thus far, we'd been using PostgreSQL's copy from command to read those files into the database. In reality, though, a very, very small percentage of rows in those tables actually change. So, I wrote a program that takes the output of diff and converts it into a series of delete and insert commands; benchmarking shows that this is roughly 300 times faster in our use. And that's why I need a fast diff. Even if it takes as long as the database bulk loads, we can run it on another server and use 20 seconds of CPU for PostgreSQL instead of 45 minutes. The practical upshot is that the database will never get sluggish, even if the other diff server is loaded to the gills. -- Kirk Strauser Does the overall order of lines change every time you dump the tables? If not, is there any inexpensive way to sort them (not alphabetically, but just that the order stays the same)? If it does/can, then there's a trivial solution (a few lines in perl, or a hundred lines in C) that'll make the speed roughly similar to that of I/O. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]