Tobias, The question is that the "cleanup" operation is not aware of what's been archived, distributed or published. And it's not aware because they're effectively different mediapackages in different places of Matterhorn which actually the same recording but in different contexts. It all goes down to what Olli said:
In my opinion it is _crucial_ that MH keeps track of all the metadata and > that the media files are handled by MH But it doesn't (or at least it doesn't do it so well), and the archived copy gets out of sync, etc. So what's the point of archiving one copy but working on *other* copy of the mediapackage, anyway? We are working on the same resources but from different (and unsynchronized) views. Also, when someone is testing Matterhorn, the disk consumption is also a characteristic to consider. If we keep all the "garbage" generated in the workflows, we'll bias the estimates the adopters will be making about how much disk space Matterhorn requires. So I disagree with Tobias in that this matter is not only relevant to those who are already designing their pilots, but also to those who are considering whether or not they are interested in deploying such pilots. There's also a solution that Tobias already hinted in his mail: as long as the "cleanup" operation is *before* the "archive" and "publish" operation, we won't end up with broken references. Therefore, I'm afraid I'm voting -1 against this proposal, and I #propose changing it to "place the operations in the correct order; i.e. cleanup first and then archive and publish". Of course, that probably means the arguments to the cleanup operation will have to change, so that all the files intended to be published and archived are kept (and I have just thought that if you change that arguments accordingly, it doesn't matter where you put the operation, since those files will be saved anyway). Best regards Rubén 2012/6/20 Tobias Wunden <[email protected]> > Hi Ruben, > > On 20.06.2012, at 11:41, Rubén Pérez <[email protected]> wrote: > > > I don't quite follow. How are those mediapackages invalid? When > mediapackage elements are cleaned up, they are effectively taken out from > the manifests and so on, so I don't know why there should be "pointers to > files and catalogs that do not exist anymore". > > depending on where you put the cleanup operation in your workflow, you are > right and there are no dangling pointers. This is the case if you put that > operation before "archive" and "publish". If you put it after either of > these two (which currently is the case in the workflows we ship), the > mediapackages stored in either one of these two systems will be invalid > because they are now pointing to files that are no longer there. > > > I'm particularly against keeping intermediate "work" files, which are > created in the middle steps of the workflow but never get distributed. > Those files are a consequence of the specific implementation of the > workflow and, should another workflow be run, they should be re-created as > needed. After the workflow ends, in my view they're just garbage (which > accounts for the fact that the name given to the operation that gets rid of > those files is "cleanup"). > > I agree. > > > If somehow the cleanup operation is not correctly deleting all the > references to the deleted files, then the question is not skipping the > cleanup operation alltogether, because the need to save disk space is still > there, and it's critical in most cases. The right way to go is fixing the > cleanup operation, or whichever processes that are failing to update the > broken references. If the problem is that the distributed files are not > kept, then it's a question of changing the default workflow and tell the > "cleanup" operation to keep those files also. > > Saving disk space becomes important as you are moving from "let's take a > look at Matterhorn" to a production environment. At that point, you'll also > have some more insight into what the workflows do, where they are putting > files and what implications it may have to remove some of them as part of > the cleanup operation. This is why I am suggesting to keep these files by > default and let people make a conscious decision on whether to throw them > away or not instead of throwing them away in the first place and hoping > that people make conscious decisions on keeping it. > > If your system is running out of disk space, you will start thinking about > strategies to overcome this issue. If your data is gone by default, there > is not a lot you can do in hindsight... > > > As an adopter institution, the disk space consumed by Matterhorn (by our > media content in general) is a critical issue. I won't vote on this until > knowing about those "broken references" better, but the cleanup operation > makes the disk comsumption more efficient, and in general I'm against > removing it completely from the default workflow. > > It is a critical issue, there is no doubt about it. But the "out of the > box" experience of Matterhorn is critical, and by throwing stuff away > during cleanup in an inconsistent way, we are more or less makeing sure > that people will be running into errors when using the episode ui to do > reprocessing of any kind or even simple retractions. > > Tobias > _______________________________________________ > Matterhorn mailing list > [email protected] > http://lists.opencastproject.org/mailman/listinfo/matterhorn > > > To unsubscribe please email > [email protected] > _______________________________________________ >
_______________________________________________ Matterhorn mailing list [email protected] http://lists.opencastproject.org/mailman/listinfo/matterhorn To unsubscribe please email [email protected] _______________________________________________
