On Fri, 24 Aug 2018 10:05:14 +0200, Boris FELD wrote: > On 23/08/2018 14:48, Yuya Nishihara wrote: > > On Wed, 22 Aug 2018 21:35:31 +0900, Yuya Nishihara wrote: > >> On Tue, 21 Aug 2018 17:11:51 +0200, Denis Laxalde wrote: > >>> Yuya Nishihara a écrit : > >>>> On Tue, 21 Aug 2018 14:10:33 +0200, Denis Laxalde wrote: > >>>>> # HG changeset patch > >>>>> # User Denis Laxalde <denis.laxa...@logilab.fr> > >>>>> # Date 1534853203 -7200 > >>>>> # Tue Aug 21 14:06:43 2018 +0200 > >>>>> # Node ID c43df6ff42d26163d19e99e15a3cf3094020d822 > >>>>> # Parent c62184c6299c09d2e8e7be340f9aee138229cb86 > >>>>> # Available At http://hg.logilab.org/users/dlaxalde/hg > >>>>> # hg pull http://hg.logilab.org/users/dlaxalde/hg -r > >>>>> c43df6ff42d2 > >>>>> # EXP-Topic issue5965 > >>>>> diff: use a threshold on similarity index before using word-diff > >>>>> (issue5965) > >>>>> > >>>>> The threshold is chosen quite arbitrarily with a value of 0.5. It does > >>>>> not change the results of test-diff-color.t whereas higher values (e.g. > >>>>> 0.6) would. Looking at what this produces on some changesets in recent > >>>>> history (e.g. 037debbf869c or 7acec9408e1c), this significantly improves > >>>>> diff readability. > >>>>> > >>>>> Similarity index is computed using difflib.SequenceMatcher's ratio() > >>>>> method; this is documented as being "expensive", but other faster > >>>>> methods > >>>>> (that compute an upper bound value) do not give good results. > >>>>> Nevertheless, since we compute this ratio on each hunk which are usually > >>>>> small, this might not be problematic in most cases. Also, as we'd > >>>>> short-circuit computation of inline colors for those hunks that are not > >>>>> similar enough, this "expensive" ratio computation might also be > >>>>> compensated. > >>>> Can you test this against a large BLOB-ish diff (such as > >>>> machine-generated > >>>> 10k-line JSON, a binary in Intel HEX format, etc.)? Last time I faced > >>>> that, > >>>> the original difflib-based algorithm was painfully slow (~100s-ish to > >>>> yield > >>>> one hunk), which made me think the word-diff should never be turned on by > >>>> default. > >>> I've set up a test repo with some JSON at > >>> https://bitbucket.org/dlax/hg-worddiff-tests. As far as I can tell, > >>> there's no significant difference when diffing the last changeset; > >> Thanks, but it looks cheaper to compute than the stuff I had at work. I'll > >> try to collect some number if I get a chance. > > $ hg diff -c REV --color=always --config diff.word-diff=true --time > > > /dev/null > > (orig) 1.250sec > > (new) 1259.490sec > > > > It's an ASCII-fied FPGA image (called tabular text file), containing ~320k > > decimal numbers plus commas (so ~1000k words in our word-diff.) And there > > are some large hunks as it is a diff of two similar BLOBs split into chunks > > per N bytes. > These files seems to have interesting characteristics that would be > useful for performance testing. > > Would it be possible to get examples or redacted versions of such files?
(CC the list again) It's a closed binary, but you can see how slow the difflib is by using random data. $ dd if=/dev/urandom bs=1k count=100 | hexdump -v -e '16/1 "%3u," "\n"' IIRC, the computation cost of difflib is more sensitive to input data than Mercurial's bdiff. _______________________________________________ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel