Re: [PATCH 4/4] fast-import: only store commit objects
On Tue, May 7, 2013 at 1:47 AM, Michael Haggerty mhag...@alum.mit.edu wrote: On 05/07/2013 06:47 AM, Felipe Contreras wrote: On Mon, May 6, 2013 at 10:27 PM, Michael Haggerty mhag...@alum.mit.edu wrote: You conjectured earlier that nobody uses blob marks, and I provided a counterexample. Then you proposed a workaround that would require changes to the cvs2git documentation, and I even explained how your proposed workaround is not as flexible as the status quo. cvs2git does *not* need blob marks, it does not need marks at all. The use-case that you mentioned has nothing to do with cvs2git, in fact. I can be described as this: % ./generate-blobs blobs % git fast-import --export-marks=marks blobs % ./generate-commits commits % git fast-import --import-marks=marks commits In this example 'generate-commits' has no notion of marks at all, and 'git fast-import' doesn't need marks to process both blobs and commits. The generate-blobs program generates a mark for each blob and remembers the marks in a database. The generate-commits program reads the marks from the database and incorporates them in the definitions of the commits that it writes to its output. So yes, the program pair *does* rely on marks for blobs being exported correctly. Fine: % ./generate-data data % ./split-blobs data blobs % ./split-commits data commits How exactly the files 'blobs' and 'commits' are generated is irrelevant, 'git fast-import' will need them *once*, and never again, in fact, doesn't even need to know they are two files. There's no need for marks. I've tired of this discussion. I am quite sure that your change will not be accepted, so I see no need to participate further. Please do not interpret my silence as agreement with your quarrelsome arguments. How convenient. I ask the though questions, and you suddenly get tired of the discussion. So, let me answer for you: * Which the *vastly* more common case? That blobs are needed, or that they are not? That they are not. To suggest otherwise would be ridiculous. And of course this patch will not be accepted, because nobody is interested in improving things in the most easy and sensible way. Yours and everybody else's argument is one and the same: status quo, which is of course, not an argument at all. But it's pointless to try to convince you, because a) you are not interested in changing your mind b) you are not interested in seeking the most sensible solution c) you are only interested in doing what you are used to do, which is to not change anything, ever d) you know making this the default is only slightly risky, at best, and not only the world is not going to end, but most likely nobody would even notice, and the ones that would, are probably already participating in this conversation, but you don't even want to do something slightly risky, because it would show that others were right, and your concerns don't actually affect any real users at all. But as I said, drop the patch. Who needs progress? Cheers. -- Felipe Contreras -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
Michael Haggerty mhag...@alum.mit.edu writes: CVS stores all of the revisions of a single file in a single filename,v file in rcsfile(5) format. The revisions are stored as deltas ordered so that a single revision can be reconstructed from a single serial read of the file. cvs2git reads each of these files once, reconstructing *all* of the revisions for a file in a single go. It then pours them into a git-fast-import stream as blobs and sets a mark on each blob. This is more or less off-topic but in the bigger picture it is more interesting and important X-. The way you describe how cvs2git handles the blobs is the more efficient way, given that fast-import does not even attempt to bother to create good deltas. The only thing it does is to see if the current data deltifies against the last object. IIRC, CVS's backend storage is mostly recorded in backward delta, so if you are feeding the blob data from new to old, then the resulting pack would follow Linus's law (the file generally grows over time) and would generally give you a good deltified chain of objects. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
On 05/07/2013 09:12 AM, Junio C Hamano wrote: Michael Haggerty mhag...@alum.mit.edu writes: CVS stores all of the revisions of a single file in a single filename,v file in rcsfile(5) format. The revisions are stored as deltas ordered so that a single revision can be reconstructed from a single serial read of the file. cvs2git reads each of these files once, reconstructing *all* of the revisions for a file in a single go. It then pours them into a git-fast-import stream as blobs and sets a mark on each blob. This is more or less off-topic but in the bigger picture it is more interesting and important X-. The way you describe how cvs2git handles the blobs is the more efficient way, given that fast-import does not even attempt to bother to create good deltas. The only thing it does is to see if the current data deltifies against the last object. IIRC, CVS's backend storage is mostly recorded in backward delta, so if you are feeding the blob data from new to old, then the resulting pack would follow Linus's law (the file generally grows over time) and would generally give you a good deltified chain of objects. Yes, you are correct about how CVS orders commits on the mainline. Branches are stored in the opposite order--oldest to newest--but CVS users don't tend to get carried away with branches anyway, and if the changes are small deltafication should help a lot anyway. Cool. I knew that fast-import didn't do much in the way of compression, but I didn't realize that it can compute deltas only between adjacent blobs. So cvs2git happily hits the sweet-spot of fast-import. Michael -- Michael Haggerty mhag...@alum.mit.edu http://softwareswirl.blogspot.com/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
On 05/03/2013 08:23 PM, Felipe Contreras wrote: On Fri, May 3, 2013 at 12:56 PM, Thomas Rast tr...@inf.ethz.ch wrote: Felipe Contreras felipe.contre...@gmail.com writes: How do we know that this doesn't break any users of fast-import? Your comment isn't very reassuring: the vast majority of them will never be used again So what's with the minority? Actually I don't think there's any minority. If the client program doesn't store blobs, the blob marks are not used anyway. So there's no change. I haven't been following this conversation in detail, but your proposed change sounds like something that would break cvs2git [1]. Let me explain what cvs2git does and why: CVS stores all of the revisions of a single file in a single filename,v file in rcsfile(5) format. The revisions are stored as deltas ordered so that a single revision can be reconstructed from a single serial read of the file. cvs2git reads each of these files once, reconstructing *all* of the revisions for a file in a single go. It then pours them into a git-fast-import stream as blobs and sets a mark on each blob. Only much later in the conversion does it have enough information to reconstruct tree-wide commits. At that time it outputs git-fast-import data (to a second file) defining the git commits and their ancestry. The contents are defined by referring to the marks of blobs from the first git-fast-import stream file. This strategy speeds up the conversion *enormously*. So if I understand correctly that you are proposing to stop allowing marks on blob objects to be set and/or referred to later, then I object vociferously. If I've misunderstood then I'll go back into my hole :-) Michael [1] http://cvs2svn.tigris.org/cvs2git.html -- Michael Haggerty mhag...@alum.mit.edu http://softwareswirl.blogspot.com/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
Michael Haggerty mhag...@alum.mit.edu writes: On 05/03/2013 08:23 PM, Felipe Contreras wrote: On Fri, May 3, 2013 at 12:56 PM, Thomas Rast tr...@inf.ethz.ch wrote: Felipe Contreras felipe.contre...@gmail.com writes: How do we know that this doesn't break any users of fast-import? Your comment isn't very reassuring: the vast majority of them will never be used again So what's with the minority? Actually I don't think there's any minority. If the client program doesn't store blobs, the blob marks are not used anyway. So there's no change. I haven't been following this conversation in detail, but your proposed change sounds like something that would break cvs2git [1]. Let me explain what cvs2git does and why: CVS stores all of the revisions of a single file in a single filename,v file in rcsfile(5) format. The revisions are stored as deltas ordered so that a single revision can be reconstructed from a single serial read of the file. cvs2git reads each of these files once, reconstructing *all* of the revisions for a file in a single go. It then pours them into a git-fast-import stream as blobs and sets a mark on each blob. Only much later in the conversion does it have enough information to reconstruct tree-wide commits. At that time it outputs git-fast-import data (to a second file) defining the git commits and their ancestry. The contents are defined by referring to the marks of blobs from the first git-fast-import stream file. This strategy speeds up the conversion *enormously*. So if I understand correctly that you are proposing to stop allowing marks on blob objects to be set and/or referred to later, then I object vociferously. The proposed patch wants to stop writing marks (in --export-marks) for anything but commits. Does cvs2git depend on that? I.e., are you using two separate fast-import processes for the blob and tree/commit phases you describe above? -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
On 05/06/2013 12:32 PM, Thomas Rast wrote: Michael Haggerty mhag...@alum.mit.edu writes: On 05/03/2013 08:23 PM, Felipe Contreras wrote: On Fri, May 3, 2013 at 12:56 PM, Thomas Rast tr...@inf.ethz.ch wrote: Felipe Contreras felipe.contre...@gmail.com writes: How do we know that this doesn't break any users of fast-import? Your comment isn't very reassuring: the vast majority of them will never be used again So what's with the minority? Actually I don't think there's any minority. If the client program doesn't store blobs, the blob marks are not used anyway. So there's no change. I haven't been following this conversation in detail, but your proposed change sounds like something that would break cvs2git [1]. Let me explain what cvs2git does and why: CVS stores all of the revisions of a single file in a single filename,v file in rcsfile(5) format. The revisions are stored as deltas ordered so that a single revision can be reconstructed from a single serial read of the file. cvs2git reads each of these files once, reconstructing *all* of the revisions for a file in a single go. It then pours them into a git-fast-import stream as blobs and sets a mark on each blob. Only much later in the conversion does it have enough information to reconstruct tree-wide commits. At that time it outputs git-fast-import data (to a second file) defining the git commits and their ancestry. The contents are defined by referring to the marks of blobs from the first git-fast-import stream file. This strategy speeds up the conversion *enormously*. So if I understand correctly that you are proposing to stop allowing marks on blob objects to be set and/or referred to later, then I object vociferously. The proposed patch wants to stop writing marks (in --export-marks) for anything but commits. Does cvs2git depend on that? I.e., are you using two separate fast-import processes for the blob and tree/commit phases you describe above? Yes, it can be handy to start loading the first blobfile in parallel with the later stages of the conversion, before the second dumpfile is ready. In that case the user needs to pass --export-marks to the first fast-import process to export marks on blobs so that the marks can be passed to the second fast-import via --import-marks. So the proposed change would break a documented use of cvs2git. Making the export of blob marks optional would of course be OK, as long as the default is to export them. Michael -- Michael Haggerty mhag...@alum.mit.edu http://softwareswirl.blogspot.com/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
Hi Thomas, On Mon, 6 May 2013, Thomas Rast wrote: The proposed patch wants to stop writing marks (in --export-marks) for anything but commits. Then it should not go in. Ciao, Dscho -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
Michael Haggerty mhag...@alum.mit.edu writes: Yes, it can be handy to start loading the first blobfile in parallel with the later stages of the conversion, before the second dumpfile is ready. In that case the user needs to pass --export-marks to the first fast-import process to export marks on blobs so that the marks can be passed to the second fast-import via --import-marks. So the proposed change would break a documented use of cvs2git. Making the export of blob marks optional would of course be OK, as long as the default is to export them. Thanks for a concise summary. Your use case fits exactly what Felipe conjectured as the nonexistent minority. An option that lets the caller say I only care about marks on these types of objects to be written to (and read from) the exported marks file would help Felipe's use case without harming your use case, and would be a sane and safe way to go. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
On Mon, May 6, 2013 at 5:45 AM, Michael Haggerty mhag...@alum.mit.edu wrote: On 05/06/2013 12:32 PM, Thomas Rast wrote: Michael Haggerty mhag...@alum.mit.edu writes: On 05/03/2013 08:23 PM, Felipe Contreras wrote: On Fri, May 3, 2013 at 12:56 PM, Thomas Rast tr...@inf.ethz.ch wrote: Felipe Contreras felipe.contre...@gmail.com writes: How do we know that this doesn't break any users of fast-import? Your comment isn't very reassuring: the vast majority of them will never be used again So what's with the minority? Actually I don't think there's any minority. If the client program doesn't store blobs, the blob marks are not used anyway. So there's no change. I haven't been following this conversation in detail, but your proposed change sounds like something that would break cvs2git [1]. Let me explain what cvs2git does and why: CVS stores all of the revisions of a single file in a single filename,v file in rcsfile(5) format. The revisions are stored as deltas ordered so that a single revision can be reconstructed from a single serial read of the file. cvs2git reads each of these files once, reconstructing *all* of the revisions for a file in a single go. It then pours them into a git-fast-import stream as blobs and sets a mark on each blob. Only much later in the conversion does it have enough information to reconstruct tree-wide commits. At that time it outputs git-fast-import data (to a second file) defining the git commits and their ancestry. The contents are defined by referring to the marks of blobs from the first git-fast-import stream file. This strategy speeds up the conversion *enormously*. So if I understand correctly that you are proposing to stop allowing marks on blob objects to be set and/or referred to later, then I object vociferously. The proposed patch wants to stop writing marks (in --export-marks) for anything but commits. Does cvs2git depend on that? I.e., are you using two separate fast-import processes for the blob and tree/commit phases you describe above? Yes, it can be handy to start loading the first blobfile in parallel with the later stages of the conversion, before the second dumpfile is ready. In that case the user needs to pass --export-marks to the first fast-import process to export marks on blobs so that the marks can be passed to the second fast-import via --import-marks. It can be used that way, but it doesn't have to be. So the proposed change would break a documented use of cvs2git. It's documented as an alternative. How many people actually use this form over the other? Is there really any advantage? It's possibly that basically nobody is using this form, and there's no benefits. Making the export of blob marks optional would of course be OK, as long as the default is to export them. Nobody benefits from leaving the default as it is. -- Felipe Contreras -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
On Mon, May 6, 2013 at 7:20 AM, Johannes Schindelin johannes.schinde...@gmx.de wrote: Hi Thomas, On Mon, 6 May 2013, Thomas Rast wrote: The proposed patch wants to stop writing marks (in --export-marks) for anything but commits. Then it should not go in. If that rationale was valid, no change in behavior should ever go in. Of course, that is not the case, we want to move forward, so changes in behavior do happen. -- Felipe Contreras -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
On Mon, May 6, 2013 at 10:18 AM, Junio C Hamano gits...@pobox.com wrote: Michael Haggerty mhag...@alum.mit.edu writes: Yes, it can be handy to start loading the first blobfile in parallel with the later stages of the conversion, before the second dumpfile is ready. In that case the user needs to pass --export-marks to the first fast-import process to export marks on blobs so that the marks can be passed to the second fast-import via --import-marks. So the proposed change would break a documented use of cvs2git. Making the export of blob marks optional would of course be OK, as long as the default is to export them. Thanks for a concise summary. Your use case fits exactly what Felipe conjectured as the nonexistent minority. Not true. cvs2git does *not* rely on the blobs being stored in a marks file, because cvs2git does not rely on mark files at all. An option that lets the caller say I only care about marks on these types of objects to be written to (and read from) the exported marks file would help Felipe's use case without harming your use case, and would be a sane and safe way to go. His case is not harmed at all. It's only the unfortunate command that is mentioned in the documentation that didn't need to be mentioned at all in the first place. It should be the other way around, if it's only this documentation that is affected, we could add a switch for that particular command, and the documentation should be updated, but it's overkill to add a switch for one odd command in some documentation somewhere, it would be much better to update the odd command to avoid using marks at all, which is what the more appropriate command does, right below in the same documentation. cat ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import Should the rest of the real world be punished because somebody added a command in some documentation somewhere, which wasn't actually needed in the first place? -- Felipe Contreras -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
On Mon, May 6, 2013 at 4:19 PM, Felipe Contreras felipe.contre...@gmail.com wrote: On Mon, May 6, 2013 at 10:18 AM, Junio C Hamano gits...@pobox.com wrote: Michael Haggerty mhag...@alum.mit.edu writes: Yes, it can be handy to start loading the first blobfile in parallel with the later stages of the conversion, before the second dumpfile is ready. In that case the user needs to pass --export-marks to the first fast-import process to export marks on blobs so that the marks can be passed to the second fast-import via --import-marks. So the proposed change would break a documented use of cvs2git. Making the export of blob marks optional would of course be OK, as long as the default is to export them. Thanks for a concise summary. Your use case fits exactly what Felipe conjectured as the nonexistent minority. Not true. cvs2git does *not* rely on the blobs being stored in a marks file, because cvs2git does not rely on mark files at all. An option that lets the caller say I only care about marks on these types of objects to be written to (and read from) the exported marks file would help Felipe's use case without harming your use case, and would be a sane and safe way to go. His case is not harmed at all. It's only the unfortunate command that is mentioned in the documentation that didn't need to be mentioned at all in the first place. It should be the other way around, if it's only this documentation that is affected, we could add a switch for that particular command, and the documentation should be updated, but it's overkill to add a switch for one odd command in some documentation somewhere, it would be much better to update the odd command to avoid using marks at all, which is what the more appropriate command does, right below in the same documentation. This would simplify the documentation, and obliterate the need to use mark files at all: diff -ur cvs2svn-2.4.0/www/cvs2git.html cvs2svn-2.4.0-mod/www/cvs2git.html --- cvs2svn-2.4.0/www/cvs2git.html 2012-09-22 01:49:55.0 -0500 +++ cvs2svn-2.4.0-mod/www/cvs2git.html 2013-05-06 16:33:12.070189985 -0500 @@ -355,14 +355,13 @@ fast-import/tt:/p pre -git fast-import --export-marks=../cvs2svn-tmp/git-marks.dat lt; ../cvs2svn-tmp/git-blob.dat -git fast-import --import-marks=../cvs2svn-tmp/git-marks.dat lt; ../cvs2svn-tmp/git-dump.dat +cat ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import /pre -pOn Linux/Unix this can be shortened to:/p +pOn Windows you should use type instead:/p pre -cat ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import +type ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import /pre /li Only in cvs2svn-2.4.0-mod/www: .cvs2git.html.swp -- Felipe Contreras -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
On 05/06/2013 11:19 PM, Felipe Contreras wrote: On Mon, May 6, 2013 at 10:18 AM, Junio C Hamano gits...@pobox.com wrote: Michael Haggerty mhag...@alum.mit.edu writes: Yes, it can be handy to start loading the first blobfile in parallel with the later stages of the conversion, before the second dumpfile is ready. In that case the user needs to pass --export-marks to the first fast-import process to export marks on blobs so that the marks can be passed to the second fast-import via --import-marks. So the proposed change would break a documented use of cvs2git. Making the export of blob marks optional would of course be OK, as long as the default is to export them. Thanks for a concise summary. Your use case fits exactly what Felipe conjectured as the nonexistent minority. Not true. cvs2git does *not* rely on the blobs being stored in a marks file, because cvs2git does not rely on mark files at all. An option that lets the caller say I only care about marks on these types of objects to be written to (and read from) the exported marks file would help Felipe's use case without harming your use case, and would be a sane and safe way to go. His case is not harmed at all. It's only the unfortunate command that is mentioned in the documentation that didn't need to be mentioned at all in the first place. It should be the other way around, if it's only this documentation that is affected, we could add a switch for that particular command, and the documentation should be updated, but it's overkill to add a switch for one odd command in some documentation somewhere, it would be much better to update the odd command to avoid using marks at all, which is what the more appropriate command does, right below in the same documentation. cat ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import Should the rest of the real world be punished because somebody added a command in some documentation somewhere, which wasn't actually needed in the first place? Don't get too fixated on the documentation. The documentation just gives some examples of how cvs2git can be used. The reason that cvs2git outputs two files is that the first file is emitted at the very beginning of the conversion and the second at the very end. These conversions can take a long time ( 1 day for very big repos), can be interrupted and restarted between passes, and passes can even be re-run with changed configurations. CVS write access has to be turned off before the start of the final conversion, so no VCS is possible until the conversion is over. So users are very interested in keeping the downtime minimal. The blobfile can also be unwieldy (its size is approximately the sum of the sizes of all revisions of all files in the project). Being able to load the blobfile into one fast-import process and the dumpfile into a different process (which relies on the feature that you propose removing) opens up a lot of possibilities: * The first fast-import of the blobfile can be started as soon as the blobfile is complete and run in parallel with the rest of the conversion. * If the blobfile needs to be transferred over the network (e.g., because Git will be served from a different server than the one doing the conversion) the network transfer can also be done in parallel with the rest of the conversion. * The blobfile could be written to a named pipe that is being read by a git-fast-import process, to avoid having to write the blobfile to disk in the first place. * The user could run git repack between loading the blobfile and loading the dumpfile. These are just the ways that cvs2git does and/or could benefit from the flexibility that is now in git-fast-import. Other tools might also be using git-fast-import in ways that would be broken by your proposed change. Michael -- Michael Haggerty mhag...@alum.mit.edu http://softwareswirl.blogspot.com/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
On 05/06/2013 11:36 PM, Felipe Contreras wrote: This would simplify the documentation, and obliterate the need to use mark files at all: As explained in my other email, this documentation change does not remove all of the reasons that users might want to use mark files. I would still like to show users how they can load the files into Git as two separate steps. diff -ur cvs2svn-2.4.0/www/cvs2git.html cvs2svn-2.4.0-mod/www/cvs2git.html --- cvs2svn-2.4.0/www/cvs2git.html2012-09-22 01:49:55.0 -0500 +++ cvs2svn-2.4.0-mod/www/cvs2git.html2013-05-06 16:33:12.070189985 -0500 @@ -355,14 +355,13 @@ fast-import/tt:/p pre -git fast-import --export-marks=../cvs2svn-tmp/git-marks.dat lt; ../cvs2svn-tmp/git-blob.dat -git fast-import --import-marks=../cvs2svn-tmp/git-marks.dat lt; ../cvs2svn-tmp/git-dump.dat +cat ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import /pre -pOn Linux/Unix this can be shortened to:/p +pOn Windows you should use type instead:/p pre -cat ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import +type ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import /pre /li Only in cvs2svn-2.4.0-mod/www: .cvs2git.html.swp Nevertheless, it *would* be nice to add the Windows equivalent of the cat pipeline. I knew about the type command but I was under the impression that it is intended for text files and can corrupt binary files. Are you sure that using type as you suggest is binary-clean? Michael -- Michael Haggerty mhag...@alum.mit.edu http://softwareswirl.blogspot.com/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
On 05/06/2013 11:04 PM, Felipe Contreras wrote: On Mon, May 6, 2013 at 5:45 AM, Michael Haggerty mhag...@alum.mit.edu wrote: On 05/06/2013 12:32 PM, Thomas Rast wrote: Michael Haggerty mhag...@alum.mit.edu writes: On 05/03/2013 08:23 PM, Felipe Contreras wrote: On Fri, May 3, 2013 at 12:56 PM, Thomas Rast tr...@inf.ethz.ch wrote: Felipe Contreras felipe.contre...@gmail.com writes: How do we know that this doesn't break any users of fast-import? Your comment isn't very reassuring: the vast majority of them will never be used again So what's with the minority? Actually I don't think there's any minority. If the client program doesn't store blobs, the blob marks are not used anyway. So there's no change. I haven't been following this conversation in detail, but your proposed change sounds like something that would break cvs2git [1]. Let me explain what cvs2git does and why: CVS stores all of the revisions of a single file in a single filename,v file in rcsfile(5) format. The revisions are stored as deltas ordered so that a single revision can be reconstructed from a single serial read of the file. cvs2git reads each of these files once, reconstructing *all* of the revisions for a file in a single go. It then pours them into a git-fast-import stream as blobs and sets a mark on each blob. Only much later in the conversion does it have enough information to reconstruct tree-wide commits. At that time it outputs git-fast-import data (to a second file) defining the git commits and their ancestry. The contents are defined by referring to the marks of blobs from the first git-fast-import stream file. This strategy speeds up the conversion *enormously*. So if I understand correctly that you are proposing to stop allowing marks on blob objects to be set and/or referred to later, then I object vociferously. The proposed patch wants to stop writing marks (in --export-marks) for anything but commits. Does cvs2git depend on that? I.e., are you using two separate fast-import processes for the blob and tree/commit phases you describe above? Yes, it can be handy to start loading the first blobfile in parallel with the later stages of the conversion, before the second dumpfile is ready. In that case the user needs to pass --export-marks to the first fast-import process to export marks on blobs so that the marks can be passed to the second fast-import via --import-marks. It can be used that way, but it doesn't have to be. Please see my other email for an explanation of how the flexibility of loading the two files separately can bring concrete benefits. So the proposed change would break a documented use of cvs2git. It's documented as an alternative. How many people actually use this form over the other? Is there really any advantage? It's possibly that basically nobody is using this form, and there's no benefits. You conjectured earlier that nobody uses blob marks, and I provided a counterexample. Then you proposed a workaround that would require changes to the cvs2git documentation, and I even explained how your proposed workaround is not as flexible as the status quo. Do you want to go through the same argument with every possible user of git-fast-import? It would be insanity to change the default behavior when a workaround on the Git side (namely adding an option that tells git-fast-import *not* to emit blob marks) would be quite straightforward to implement and have little maintenance cost. Making the export of blob marks optional would of course be OK, as long as the default is to export them. Nobody benefits from leaving the default as it is. Sure they do. Any tool that *knows* that it doesn't need blob marks can pass the new option and benefit. Other tools benefit from not being broken by your change. Michael -- Michael Haggerty mhag...@alum.mit.edu http://softwareswirl.blogspot.com/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
On Mon, May 6, 2013 at 11:32 PM, Johannes Schindelin johannes.schinde...@gmx.de wrote: Hi Michael, On Tue, 7 May 2013, Michael Haggerty wrote: I knew about the type command but I was under the impression that it is intended for text files and can corrupt binary files. Are you sure that using type as you suggest is binary-clean? type is not binary-clean. At least on some Windows versions, type also has a limit on file size. copy /b file1+file2 destfile Then. -- Felipe Contreras -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
On Mon, May 6, 2013 at 9:58 PM, Michael Haggerty mhag...@alum.mit.edu wrote: On 05/06/2013 11:19 PM, Felipe Contreras wrote: On Mon, May 6, 2013 at 10:18 AM, Junio C Hamano gits...@pobox.com wrote: Michael Haggerty mhag...@alum.mit.edu writes: Yes, it can be handy to start loading the first blobfile in parallel with the later stages of the conversion, before the second dumpfile is ready. In that case the user needs to pass --export-marks to the first fast-import process to export marks on blobs so that the marks can be passed to the second fast-import via --import-marks. So the proposed change would break a documented use of cvs2git. Making the export of blob marks optional would of course be OK, as long as the default is to export them. Thanks for a concise summary. Your use case fits exactly what Felipe conjectured as the nonexistent minority. Not true. cvs2git does *not* rely on the blobs being stored in a marks file, because cvs2git does not rely on mark files at all. An option that lets the caller say I only care about marks on these types of objects to be written to (and read from) the exported marks file would help Felipe's use case without harming your use case, and would be a sane and safe way to go. His case is not harmed at all. It's only the unfortunate command that is mentioned in the documentation that didn't need to be mentioned at all in the first place. It should be the other way around, if it's only this documentation that is affected, we could add a switch for that particular command, and the documentation should be updated, but it's overkill to add a switch for one odd command in some documentation somewhere, it would be much better to update the odd command to avoid using marks at all, which is what the more appropriate command does, right below in the same documentation. cat ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import Should the rest of the real world be punished because somebody added a command in some documentation somewhere, which wasn't actually needed in the first place? Don't get too fixated on the documentation. The documentation just gives some examples of how cvs2git can be used. The reason that cvs2git outputs two files is that the first file is emitted at the very beginning of the conversion and the second at the very end. These conversions can take a long time ( 1 day for very big repos), can be interrupted and restarted between passes, and passes can even be re-run with changed configurations. CVS write access has to be turned off before the start of the final conversion, so no VCS is possible until the conversion is over. So users are very interested in keeping the downtime minimal. Of course VCS is possible before the conversion is over, you can do all the cvs2git commands first, turn on CVS, and let 'git fast-import' do it's thing afterwards, can you not? The blobfile can also be unwieldy (its size is approximately the sum of the sizes of all revisions of all files in the project). Being able to load the blobfile into one fast-import process and the dumpfile into a different process (which relies on the feature that you propose removing) opens up a lot of possibilities: * The first fast-import of the blobfile can be started as soon as the blobfile is complete and run in parallel with the rest of the conversion. Is that reason enough to punish everyone else? * If the blobfile needs to be transferred over the network (e.g., because Git will be served from a different server than the one doing the conversion) the network transfer can also be done in parallel with the rest of the conversion. Ditto. * The blobfile could be written to a named pipe that is being read by a git-fast-import process, to avoid having to write the blobfile to disk in the first place. And it would still be possible. { cat file1; cat file2; } pipe What am I missing? * The user could run git repack between loading the blobfile and loading the dumpfile. Is that reason enough to punish everyone else? These are just the ways that cvs2git does and/or could benefit from the flexibility that is now in git-fast-import. Sure, all those things _migt_ be nice, but nothing would be broken, would it? cvs2git can still achieve it's goal without it. And if this feature is removed by *default*, it would be trivial to enable an option to tell 'git fast-import' to store blobs from cvs2git, would it not? Other tools might also be using git-fast-import in ways that would be broken by your proposed change. Yeah, everything is possible, but the key word is *might*. Will anything be *actually* broken by this change? My guess is, very few things, if any. Cheers. -- Felipe Contreras -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
Hi, On Tue, 7 May 2013, Michael Haggerty wrote: On 05/06/2013 11:04 PM, Felipe Contreras wrote: On Mon, May 6, 2013 at 5:45 AM, Michael Haggerty mhag...@alum.mit.edu wrote: On 05/06/2013 12:32 PM, Thomas Rast wrote: So the proposed change would break a documented use of cvs2git. It's documented as an alternative. How many people actually use this form over the other? Is there really any advantage? It's possibly that basically nobody is using this form, and there's no benefits. You conjectured earlier that nobody uses blob marks, and I provided a counterexample. Then you proposed a workaround that would require changes to the cvs2git documentation, and I even explained how your proposed workaround is not as flexible as the status quo. Do you want to go through the same argument with every possible user of git-fast-import? It would be insanity to change the default behavior when a workaround on the Git side (namely adding an option that tells git-fast-import *not* to emit blob marks) would be quite straightforward to implement and have little maintenance cost. I really wonder how many more counterexamples are required to settle this discussion. There is no good reason to artificially limit Git's capabilities here, especially when it has been demonstrated that supporting that capability is not only possible, but also outright easy. Ciao, Dscho -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
On Mon, May 6, 2013 at 11:39 PM, Johannes Schindelin johannes.schinde...@gmx.de wrote: Hi, On Tue, 7 May 2013, Michael Haggerty wrote: On 05/06/2013 11:04 PM, Felipe Contreras wrote: On Mon, May 6, 2013 at 5:45 AM, Michael Haggerty mhag...@alum.mit.edu wrote: On 05/06/2013 12:32 PM, Thomas Rast wrote: So the proposed change would break a documented use of cvs2git. It's documented as an alternative. How many people actually use this form over the other? Is there really any advantage? It's possibly that basically nobody is using this form, and there's no benefits. You conjectured earlier that nobody uses blob marks, and I provided a counterexample. Then you proposed a workaround that would require changes to the cvs2git documentation, and I even explained how your proposed workaround is not as flexible as the status quo. Do you want to go through the same argument with every possible user of git-fast-import? It would be insanity to change the default behavior when a workaround on the Git side (namely adding an option that tells git-fast-import *not* to emit blob marks) would be quite straightforward to implement and have little maintenance cost. I really wonder how many more counterexamples are required to settle this discussion. One. I haven't seen one use-case that *requires* blob marks, I haven't seen one tool that utilizes blob marks. And no, cvs2git doesn't use marks at all. There is no good reason to artificially limit Git's capabilities here, especially when it has been demonstrated that supporting that capability is not only possible, but also outright easy. Strawman. Nobody is arguing that there shouldn't be an option to enable blob exporting, the argument is that there's no point in making that the *default*. -- Felipe Contreras -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
Felipe Contreras felipe.contre...@gmail.com writes: There's no point in storing blob, they would increase the time of loading the marks, and the vast majority of them will never be used again. This also makes fast-export and fast-import marks compatible. [...] - if (m-data.marked[k]) + if (m-data.marked[k]) { + struct object_entry *e; + e = m-data.marked[k]; + if (e-type != OBJ_COMMIT) + continue; fprintf(f, :% PRIuMAX %s\n, base + k, - sha1_to_hex(m-data.marked[k]-idx.sha1)); + sha1_to_hex(e-idx.sha1)); + } IIUC, you are unconditionally storing only marks to commit objects. Are you allowed to do that at this point? I notice that git-fast-export(1) says --export-marks=file Dumps the internal marks table to file when complete. Marks are written one per line as :markid SHA-1. Only marks for revisions are dumped[...] But git-fast-import(1) says nothing of the sort; I would even claim that --export-marks=file Dumps the internal marks table to file when complete. means that the *full* marks table is dumped. How do we know that this doesn't break any users of fast-import? Your comment isn't very reassuring: the vast majority of them will never be used again So what's with the minority? In any case, if this does go in, please update the documentation to match, probably by copying the sentence from git-fast-export(1). -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
On Fri, May 3, 2013 at 12:56 PM, Thomas Rast tr...@inf.ethz.ch wrote: Felipe Contreras felipe.contre...@gmail.com writes: How do we know that this doesn't break any users of fast-import? Your comment isn't very reassuring: the vast majority of them will never be used again So what's with the minority? Actually I don't think there's any minority. If the client program doesn't store blobs, the blob marks are not used anyway. So there's no change. However, there's a chance the client programs do rely on blob objects, in which case things would break. I would like to analyze some client programs of fast-import out there, to see what would be the impact. But I think this has to be done either way, the only question is how. Having a warning would take a lot of effort, and it might be for nothing. I think it might make sense to knowingly make the potentially dangerous change, and see what breaks. Most likely nothing will, but if something does, we revert immediately. Otherwise we would be stuck in this non-ideal state forever. Cheers. -- Felipe Contreras -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
Thomas Rast tr...@inf.ethz.ch writes: IIUC, you are unconditionally storing only marks to commit objects. Are you allowed to do that at this point? I notice that git-fast-export(1) says --export-marks=file Dumps the internal marks table to file when complete. Marks are written one per line as :markid SHA-1. Only marks for revisions are dumped[...] But git-fast-import(1) says nothing of the sort; I would even claim that --export-marks=file Dumps the internal marks table to file when complete. means that the *full* marks table is dumped. How do we know that this doesn't break any users of fast-import? Your comment isn't very reassuring: the vast majority of them will never be used again So what's with the minority? In any case, if this does go in, please update the documentation to match, probably by copying the sentence from git-fast-export(1). A safe and sane approach may be to teach these an option to tell them to omit non-commits or to emit all kinds, and make remote-bzr use that to exclude non-commits. If the defaults is matched to the current behaviour, nobody will get hurt, and Felipe's Emacs import, knowing that it does not need marks to blobs, can take advantage of the new feature. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
On Fri, May 3, 2013 at 5:08 PM, Junio C Hamano gits...@pobox.com wrote: Thomas Rast tr...@inf.ethz.ch writes: IIUC, you are unconditionally storing only marks to commit objects. Are you allowed to do that at this point? I notice that git-fast-export(1) says --export-marks=file Dumps the internal marks table to file when complete. Marks are written one per line as :markid SHA-1. Only marks for revisions are dumped[...] But git-fast-import(1) says nothing of the sort; I would even claim that --export-marks=file Dumps the internal marks table to file when complete. means that the *full* marks table is dumped. How do we know that this doesn't break any users of fast-import? Your comment isn't very reassuring: the vast majority of them will never be used again So what's with the minority? In any case, if this does go in, please update the documentation to match, probably by copying the sentence from git-fast-export(1). A safe and sane approach may be to teach these an option to tell them to omit non-commits or to emit all kinds, and make remote-bzr use that to exclude non-commits. This has nothing to do with remote-bzr, or any remote helper. These objects are not useful, not even to 'git fast-export'. If the defaults is matched to the current behaviour, nobody will get hurt Changing nothing always ensures that nobody will get hurt, but that doesn't improve anything either. -- Felipe Contreras -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
Felipe Contreras felipe.contre...@gmail.com writes: A safe and sane approach may be to teach these an option to tell them to omit non-commits or to emit all kinds, and make remote-bzr use that to exclude non-commits. This has nothing to do with remote-bzr, or any remote helper. These objects are not useful, not even to 'git fast-export'. If the defaults is matched to the current behaviour, nobody will get hurt Changing nothing always ensures that nobody will get hurt, but that doesn't improve anything either. I do not quite follow. Allowing existing code to depend on old behaviour, while letting new code that knows it does not need anything but commits, will get the best of both worlds. The new code will definitely improve (otherwise you won't be writing these patches), and the old code will not get hurt. Where is this doesn't improve anything coming from? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] fast-import: only store commit objects
On Fri, May 3, 2013 at 6:45 PM, Junio C Hamano gits...@pobox.com wrote: Felipe Contreras felipe.contre...@gmail.com writes: A safe and sane approach may be to teach these an option to tell them to omit non-commits or to emit all kinds, and make remote-bzr use that to exclude non-commits. This has nothing to do with remote-bzr, or any remote helper. These objects are not useful, not even to 'git fast-export'. If the defaults is matched to the current behaviour, nobody will get hurt Changing nothing always ensures that nobody will get hurt, but that doesn't improve anything either. I do not quite follow. Allowing existing code to depend on old behaviour, while letting new code that knows it does not need anything but commits, will get the best of both worlds. The new code will definitely improve (otherwise you won't be writing these patches), and the old code will not get hurt. Where is this doesn't improve anything coming from? So let's suppose we add a new 'feature no-blob-store' to 'git fast-import', and some clients of 'git fast-import' enable it, and benefit from it. How does that benefit the rest of fast-import clients? You are worrying about clients that most likely don't exist, and don't let existing real clients benefit for free. This is like premature optimization. But whatever, let's keep writing and discarding these blobs, at least the previous patches do make such operations fast. You can drop this patch, because I'm not going to write code for clients that don't exist. And it seems clear that even if I investigate client apps of 'git fast-import' and I'm unable to find a single one that utilizes blobs, you still wouldn't apply this patch. So there's no point in investigating what are the *actual* users, if all we are every going to do is worry about hypothetical ones. My main interest is the previous patches, if anybody has any issues with the way this patch is handled, they can either work on the patch itself, or start a new thread in which I will not participate, I will rather concentrate on issues that affect *real* users, and have real fixes reachable today, and the previous patches in this series fit that bill. Cheers. -- Felipe Contreras -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html