Re: [fcrepo-dev] RE : Ingest performances
Jesper, I'd say fork master and submit a pull request. By the way, there's a committers call today (Thursday) from 9-9:30am Eastern (UTC-5) if you want to jump in. I think the agenda is mostly around the upcoming release of 3.6.2, but I'd welcome having either (or both) of you and Nicolas join if you're available. https://wiki.duraspace.org/display/FCREPO/2012-11-08+-+Fedora+Committer+Meeting Eddie Hi Eddie. For some reason I did not receive the above from the mailing list. I just stumbled accidentally upon it when looking for something else in the mailing-list archive. I have created a fork, implemented what I think is necessary, and is currently testing it out. I'll submit a pull request soon. -Jesper On ons, 2012-11-07 at 16:12 +0100, Jesper Damkjaer wrote: I'll like to take a stab at this one. I'm not sure though if I can create a branch or if I should create a fork. Which do You prefer? -Jesper On ons, 2012-11-07 at 12:09 +0100, Jesper Damkjaer wrote: Yes, I think it would. The problem is that the PID cannot be determined until just before the call to objectExists(). And moving the pidLock.getWriteLock() up here would essentially do the same as You suggest. -Jesper On ons, 2012-11-07 at 10:57 +, Nicolas Hervé wrote: A lock on a unique object representing the PID to be inserted would be enough to prevent this, no ? ;ppvg usxd9ds=lii On 06/11/2012 23:51, Jesper Damkjaer wrote: It might be possible to simply just move the call to pidLock.getWriteLock() up before the call to objectExists(). Then the first thread will get the lock and the second thread will be rejected. Even if the first thread have released the writelock, then it will have registered the object and the second thread will be rejected by objectExists(). This may of course give a false result if the first thread for some reason fails to actually create the object. Then both threads are rejected and no objects are created even though the second thread might have succeeded (if it had won the race). I believe we have another kind of locking mechanism which is a stringLock. This lock doesnt reject the second thread - it just make it wait until the first thread releases the lock. If this kind of lock is used, then only one of the threads can enter the last part of getIngestWriter (assuming the lock is before objectExists()) and if the first thread succeeds the second thread will, upon release, get an exception from objectExists(). If on the other hand, the first thread fails to create an object then, when the lock is released, the second thread can try to create the object. -Jesper On tir, 2012-11-06 at 17:17 -0500, aj...@virginia.edu wrote: It's starting to sound like we need a pattern that is less constrictive of resources than the use of synchronized, but still maintains some kind of guard against the kind of collision Jesper describes. A simpleminded thought: perhaps we could insert a lightweight check against the registry before registerObject() with a line of execution that backs out any changes if the check fails? Because it's an ingest, those changes shouldn't be hard to describe or undo. --- A. Soroka Software Systems Engineering :: Online Library Environment the University of Virginia Library On Nov 6, 2012, at 4:22 PM, Jesper Damkjaer wrote: I just had a quick look at the code. As far as I can see (and I may not be right) one problem may be the following: Assume both threads with same PID are inside the method getIngestWriter and has passed the method-call objectExists(...). At this point neither of the objects exists assuming none of the threads has yet reached registerObject(...). If one of the threads then get the writeLock, does what it needs to do and releases the writeLock before the second thread reaches writeLock, then the second thread will reach registerObject(...). As far as I can tell, inside registerObject(...) an SQLException will be thrown since the row already exists in the database, and the method unregisterObject(...) will be called actually unregistering a legally created object. Someone with more insight into the code should verify if the above is correct. -Jesper On tir, 2012-11-06 at 21:44 +0100, Jesper Damkjaer wrote: Hi. I originally opened the https://jira.duraspace.org/browse/FCREPO-1024 I tried back then to create tests in order to convince myself that there would be no problems in removing the synchronization, but that proved rather difficult. As far as I remember there seemed to be some problems associated with removing the synchronization, especially if two documents with the same PID were ingested at the same
Re: [fcrepo-dev] RE : Ingest performances
Yes, I think it would. The problem is that the PID cannot be determined until just before the call to objectExists(). And moving the pidLock.getWriteLock() up here would essentially do the same as You suggest. -Jesper On ons, 2012-11-07 at 10:57 +, Nicolas Hervé wrote: A lock on a unique object representing the PID to be inserted would be enough to prevent this, no ? On 06/11/2012 23:51, Jesper Damkjaer wrote: It might be possible to simply just move the call to pidLock.getWriteLock() up before the call to objectExists(). Then the first thread will get the lock and the second thread will be rejected. Even if the first thread have released the writelock, then it will have registered the object and the second thread will be rejected by objectExists(). This may of course give a false result if the first thread for some reason fails to actually create the object. Then both threads are rejected and no objects are created even though the second thread might have succeeded (if it had won the race). I believe we have another kind of locking mechanism which is a stringLock. This lock doesnt reject the second thread - it just make it wait until the first thread releases the lock. If this kind of lock is used, then only one of the threads can enter the last part of getIngestWriter (assuming the lock is before objectExists()) and if the first thread succeeds the second thread will, upon release, get an exception from objectExists(). If on the other hand, the first thread fails to create an object then, when the lock is released, the second thread can try to create the object. -Jesper On tir, 2012-11-06 at 17:17 -0500, aj...@virginia.edu wrote: It's starting to sound like we need a pattern that is less constrictive of resources than the use of synchronized, but still maintains some kind of guard against the kind of collision Jesper describes. A simpleminded thought: perhaps we could insert a lightweight check against the registry before registerObject() with a line of execution that backs out any changes if the check fails? Because it's an ingest, those changes shouldn't be hard to describe or undo. --- A. Soroka Software Systems Engineering :: Online Library Environment the University of Virginia Library On Nov 6, 2012, at 4:22 PM, Jesper Damkjaer wrote: I just had a quick look at the code. As far as I can see (and I may not be right) one problem may be the following: Assume both threads with same PID are inside the method getIngestWriter and has passed the method-call objectExists(...). At this point neither of the objects exists assuming none of the threads has yet reached registerObject(...). If one of the threads then get the writeLock, does what it needs to do and releases the writeLock before the second thread reaches writeLock, then the second thread will reach registerObject(...). As far as I can tell, inside registerObject(...) an SQLException will be thrown since the row already exists in the database, and the method unregisterObject(...) will be called actually unregistering a legally created object. Someone with more insight into the code should verify if the above is correct. -Jesper On tir, 2012-11-06 at 21:44 +0100, Jesper Damkjaer wrote: Hi. I originally opened the https://jira.duraspace.org/browse/FCREPO-1024 I tried back then to create tests in order to convince myself that there would be no problems in removing the synchronization, but that proved rather difficult. As far as I remember there seemed to be some problems associated with removing the synchronization, especially if two documents with the same PID were ingested at the same time. I'm still interested in looking in to this, and will be happy to look at my old branch. -Jesper On tir, 2012-11-06 at 12:46 -0500, aj...@virginia.edu wrote: There is a bit of history to this discussion: https://jira.duraspace.org/browse/FCREPO-1024 --- A. Soroka Software Systems Engineering :: Online Library Environment the University of Virginia Library On Nov 6, 2012, at 11:53 AM, Scott Prater wrote: Nicolas, That's very encouraging... this problem has been on my radar for years. Not being very familiar with the details of the internals of DefaultDOManager, I can't comment concretely on the merits of your patch, but I would be especially interested in tests that provoke race conditions and concurrent writes, and makes sure Fedora handles those situations cleanly: even though PID generation is synchronized, would it still be possible for Fedora to attempt to write the same datastreams, provision identical values in the resource index, etc. in different threads once the object has been created and the PID assigned? As I dimly recall, I think the getIngestWriter method was synchronized because there were some problems in the early
Re: [fcrepo-dev] RE : Ingest performances
I'll like to take a stab at this one. I'm not sure though if I can create a branch or if I should create a fork. Which do You prefer? -Jesper On ons, 2012-11-07 at 12:09 +0100, Jesper Damkjaer wrote: Yes, I think it would. The problem is that the PID cannot be determined until just before the call to objectExists(). And moving the pidLock.getWriteLock() up here would essentially do the same as You suggest. -Jesper On ons, 2012-11-07 at 10:57 +, Nicolas Hervé wrote: A lock on a unique object representing the PID to be inserted would be enough to prevent this, no ? ;ppvg usxd9ds=lii On 06/11/2012 23:51, Jesper Damkjaer wrote: It might be possible to simply just move the call to pidLock.getWriteLock() up before the call to objectExists(). Then the first thread will get the lock and the second thread will be rejected. Even if the first thread have released the writelock, then it will have registered the object and the second thread will be rejected by objectExists(). This may of course give a false result if the first thread for some reason fails to actually create the object. Then both threads are rejected and no objects are created even though the second thread might have succeeded (if it had won the race). I believe we have another kind of locking mechanism which is a stringLock. This lock doesnt reject the second thread - it just make it wait until the first thread releases the lock. If this kind of lock is used, then only one of the threads can enter the last part of getIngestWriter (assuming the lock is before objectExists()) and if the first thread succeeds the second thread will, upon release, get an exception from objectExists(). If on the other hand, the first thread fails to create an object then, when the lock is released, the second thread can try to create the object. -Jesper On tir, 2012-11-06 at 17:17 -0500, aj...@virginia.edu wrote: It's starting to sound like we need a pattern that is less constrictive of resources than the use of synchronized, but still maintains some kind of guard against the kind of collision Jesper describes. A simpleminded thought: perhaps we could insert a lightweight check against the registry before registerObject() with a line of execution that backs out any changes if the check fails? Because it's an ingest, those changes shouldn't be hard to describe or undo. --- A. Soroka Software Systems Engineering :: Online Library Environment the University of Virginia Library On Nov 6, 2012, at 4:22 PM, Jesper Damkjaer wrote: I just had a quick look at the code. As far as I can see (and I may not be right) one problem may be the following: Assume both threads with same PID are inside the method getIngestWriter and has passed the method-call objectExists(...). At this point neither of the objects exists assuming none of the threads has yet reached registerObject(...). If one of the threads then get the writeLock, does what it needs to do and releases the writeLock before the second thread reaches writeLock, then the second thread will reach registerObject(...). As far as I can tell, inside registerObject(...) an SQLException will be thrown since the row already exists in the database, and the method unregisterObject(...) will be called actually unregistering a legally created object. Someone with more insight into the code should verify if the above is correct. -Jesper On tir, 2012-11-06 at 21:44 +0100, Jesper Damkjaer wrote: Hi. I originally opened the https://jira.duraspace.org/browse/FCREPO-1024 I tried back then to create tests in order to convince myself that there would be no problems in removing the synchronization, but that proved rather difficult. As far as I remember there seemed to be some problems associated with removing the synchronization, especially if two documents with the same PID were ingested at the same time. I'm still interested in looking in to this, and will be happy to look at my old branch. -Jesper On tir, 2012-11-06 at 12:46 -0500, aj...@virginia.edu wrote: There is a bit of history to this discussion: https://jira.duraspace.org/browse/FCREPO-1024 --- A. Soroka Software Systems Engineering :: Online Library Environment the University of Virginia Library On Nov 6, 2012, at 11:53 AM, Scott Prater wrote: Nicolas, That's very encouraging... this problem has been on my radar for years. Not being very familiar with the details of the internals of DefaultDOManager, I can't comment concretely on the merits of your patch, but I would be especially interested in tests that provoke race conditions and concurrent writes, and makes sure Fedora handles those situations cleanly: even though PID generation is
Re: [fcrepo-dev] RE : Ingest performances
If I recall correctly, that method is synchronized so that a document factory can be reused. We might be able to remove the synchronization if we use the Templates class, I'll look into it. It is important to note that having a single manager object increases the importance of guaranteeing thread safety, particularly in the situation you describe. Also, the cost of acquiring an object lock is typically less than recreating complex objects, but that's something we can test for the 3.7 branch. - Ben On Nov 6, 2012 11:14 AM, Nicolas Hervé nhe...@ina.fr wrote: Hi, I think I've found the main problem with massive parallel ingestion. I'm working with the last github snapshot. In org.fcrepo.server.storage.DefaultDOManager, the getIngestWriter method should not be synchronized as it seems there is only a single instance of that class for the server. The internal objects of this class seem to be correctly synchronized (pid generation) of new objects are recreated on each call (inside Translator, a new DODeserializer is created and the same happens inside the Validator). I've tested with FOXML ingestion and now almost all the CPUs are used. I've not been deeper to check that every inserted object is not corrupted, but after a quick look, it seems OK. I guess the same kind of patch could also apply on object deletion. If one of you that better understand that part could have a look, it seems it would be a nice patch, not too hard to test, with great performance improvements. Regards, Nicolas HERVE On 28/09/2012 11:25, Nicolas Hervé wrote: Hi, indeed, it seems we are exactly in the same configuration (millions of DO with some metadata and external content) with almost the same hardware. I've not identified the bottleneck in the massive parallel ingestion process right now, but I highly suspect a synchronized portion of code somewhere in the chain. I hope Edwin could say more about this :-) For the querying of dc fields, index have to be created in the Mysql schema and SQL queries are far from being optimal. Currently I only patched for my own purposes (my datamodel / my queries) and I bypass some code portions in the following classes : org.fcrepo.server.search.FieldSearchSQLImpl org.fcrepo.server.search.FieldSearchResultSQLImpl I'm really new to Fedora Commons but, from what I understand, these SQL part is quite old. Changing them for optimizations purposes could imply behaviour changes for other people. That's why I don't think simple patches could do the job. It would need a complete refactoring. That could only be done with a global point of view on the different way this classes are used in the different contexts where Fedora instances are running. Feel free to contact me to discuss this more precisely. Regards, Nicolas HERVE +33 1 49 83 21 66 (GMT + 2) On 27/09/2012 18:23, Jason V wrote: Hi Nicolas, My name is Jason Varghese and I'm a senior developer at the New York Public Library. I think you are doing work similar to what I am presently doing based on reading some of your posts. We have a relatively large scale Fedora implementation here. We've had all the hardware in place for some time and are in the process of migrating from a large homegrown repository to a Fedora based platform. We have a single Fedora ingest machine and 3 Fedora readers. The ingest machine alone is 4 x 6 core processors w/ 128GB RAM. I'm in the process of generating about 1 million+ digital objects and attaching to each DO all the metadata (as managed content datastreams) and all the digital assets (as external content datastreams). The digital assets currently are about 183 TB of content (this is replicated at two sites). I have a multithreaded java client I wrote to accomplish the task for the Fedora ingest/DO generation and I use the Mediashelf REST API client for connectivity to Fedora. I was able to successfully ingest 10's of thousands of digital objects, but really need ensure this process performs optimally and scales for millions of objects. What bottlenecks were you able to identify when running your multithreaded ingest process? Look forward to learning/sharing experiences from this process with you and the community and possibly collaborating. Thanks Jason Varghese NYPL -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite.http://ad.doubleclick.net/clk;262219671;13503038;y?http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Fedora-commons-users mailing listFedora-commons-users@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/fedora-commons-users -- LogMeIn Central: Instant, anywhere, Remote PC access and
Re: [fcrepo-dev] RE : Ingest performances
Nicolas, That's very encouraging... this problem has been on my radar for years. Not being very familiar with the details of the internals of DefaultDOManager, I can't comment concretely on the merits of your patch, but I would be especially interested in tests that provoke race conditions and concurrent writes, and makes sure Fedora handles those situations cleanly: even though PID generation is synchronized, would it still be possible for Fedora to attempt to write the same datastreams, provision identical values in the resource index, etc. in different threads once the object has been created and the PID assigned? As I dimly recall, I think the getIngestWriter method was synchronized because there were some problems in the early days with concurrent writes. That may be a non-issue now, though, if PID generation is synchronized (the Akubra storage layer is designed to handle writes more robustly). Another potential issue: if you're creating a hierarchical tree of objects in parallel, and the ingest of a parent object fails: you could be left with orphaned children. But that's something that should be checked and handled higher up the stack, with some transaction/rollback logic. -- Scott On 11/06/12, Nicolas Hervé wrote: Hi, I think I've found the main problem with massive parallel ingestion. I'm working with the last github snapshot. In org.fcrepo.server.storage.DefaultDOManager, the getIngestWriter method should not be synchronized as it seems there is only a single instance of that class for the server. The internal objects of this class seem to be correctly synchronized (pid generation) of new objects are recreated on each call (inside Translator, a new DODeserializer is created and the same happens inside the Validator). I've tested with FOXML ingestion and now almost all the CPUs are used. I've not been deeper to check that every inserted object is not corrupted, but after a quick look, it seems OK. I guess the same kind of patch could also apply on object deletion. If one of you that better understand that part could have a look, it seems it would be a nice patch, not too hard to test, with great performance improvements. Regards, Nicolas HERVE On 28/09/2012 11:25, Nicolas Hervé wrote: Hi, indeed, it seems we are exactly in the same configuration (millions of DO with some metadata and external content) with almost the same hardware. I've not identified the bottleneck in the massive parallel ingestion process right now, but I highly suspect a synchronized portion of code somewhere in the chain. I hope Edwin could say more about this :-) For the querying of dc fields, index have to be created in the Mysql schema and SQL queries are far from being optimal. Currently I only patched for my own purposes (my datamodel / my queries) and I bypass some code portions in the following classes : org.fcrepo.server.search.FieldSearchSQLImpl org.fcrepo.server.search.FieldSearchResultSQLImpl I'm really new to Fedora Commons but, from what I understand, these SQL part is quite old. Changing them for optimizations purposes could imply behaviour changes for other people. That's why I don't think simple patches could do the job. It would need a complete refactoring. That could only be done with a global point of view on the different way this classes are used in the different contexts where Fedora instances are running. Feel free to contact me to discuss this more precisely. Regards, Nicolas HERVE #43;33 1 49 83 21 66 (GMT #43; 2) On 27/09/2012 18:23, Jason V wrote: Hi Nicolas, My name is Jason Varghese and I'm a senior developer at the New York Public Library. I think you are doing work similar to what I am presently doing based on reading some of your posts. We have a relatively large scale Fedora implementation here. We've had all the hardware in place for some time and are in the process of migrating from a large homegrown repository to a Fedora based platform. We have a single Fedora ingest machine and 3 Fedora readers. The ingest machine alone is 4 x 6 core processors w/ 128GB RAM. I'm in the process of generating about 1 million#43; digital objects and attaching to each DO all the metadata (as managed content datastreams) and all the digital assets (as external content datastreams). The digital assets currently are about 183 TB of content (this is replicated at two sites). I have a multithreaded java client I wrote to accomplish the task for the Fedora ingest/DO generation and I use the Mediashelf REST API client for connectivity to Fedora. I was able to successfully ingest 10's of thousands of digital objects, but really need ensure this process performs optimally and scales for millions of objects. What bottlenecks were you able to identify when running your
Re: [fcrepo-dev] RE : Ingest performances
It might be possible to simply just move the call to pidLock.getWriteLock() up before the call to objectExists(). Then the first thread will get the lock and the second thread will be rejected. Even if the first thread have released the writelock, then it will have registered the object and the second thread will be rejected by objectExists(). This may of course give a false result if the first thread for some reason fails to actually create the object. Then both threads are rejected and no objects are created even though the second thread might have succeeded (if it had won the race). I believe we have another kind of locking mechanism which is a stringLock. This lock doesnt reject the second thread - it just make it wait until the first thread releases the lock. If this kind of lock is used, then only one of the threads can enter the last part of getIngestWriter (assuming the lock is before objectExists()) and if the first thread succeeds the second thread will, upon release, get an exception from objectExists(). If on the other hand, the first thread fails to create an object then, when the lock is released, the second thread can try to create the object. -Jesper On tir, 2012-11-06 at 17:17 -0500, aj...@virginia.edu wrote: It's starting to sound like we need a pattern that is less constrictive of resources than the use of synchronized, but still maintains some kind of guard against the kind of collision Jesper describes. A simpleminded thought: perhaps we could insert a lightweight check against the registry before registerObject() with a line of execution that backs out any changes if the check fails? Because it's an ingest, those changes shouldn't be hard to describe or undo. --- A. Soroka Software Systems Engineering :: Online Library Environment the University of Virginia Library On Nov 6, 2012, at 4:22 PM, Jesper Damkjaer wrote: I just had a quick look at the code. As far as I can see (and I may not be right) one problem may be the following: Assume both threads with same PID are inside the method getIngestWriter and has passed the method-call objectExists(...). At this point neither of the objects exists assuming none of the threads has yet reached registerObject(...). If one of the threads then get the writeLock, does what it needs to do and releases the writeLock before the second thread reaches writeLock, then the second thread will reach registerObject(...). As far as I can tell, inside registerObject(...) an SQLException will be thrown since the row already exists in the database, and the method unregisterObject(...) will be called actually unregistering a legally created object. Someone with more insight into the code should verify if the above is correct. -Jesper On tir, 2012-11-06 at 21:44 +0100, Jesper Damkjaer wrote: Hi. I originally opened the https://jira.duraspace.org/browse/FCREPO-1024 I tried back then to create tests in order to convince myself that there would be no problems in removing the synchronization, but that proved rather difficult. As far as I remember there seemed to be some problems associated with removing the synchronization, especially if two documents with the same PID were ingested at the same time. I'm still interested in looking in to this, and will be happy to look at my old branch. -Jesper On tir, 2012-11-06 at 12:46 -0500, aj...@virginia.edu wrote: There is a bit of history to this discussion: https://jira.duraspace.org/browse/FCREPO-1024 --- A. Soroka Software Systems Engineering :: Online Library Environment the University of Virginia Library On Nov 6, 2012, at 11:53 AM, Scott Prater wrote: Nicolas, That's very encouraging... this problem has been on my radar for years. Not being very familiar with the details of the internals of DefaultDOManager, I can't comment concretely on the merits of your patch, but I would be especially interested in tests that provoke race conditions and concurrent writes, and makes sure Fedora handles those situations cleanly: even though PID generation is synchronized, would it still be possible for Fedora to attempt to write the same datastreams, provision identical values in the resource index, etc. in different threads once the object has been created and the PID assigned? As I dimly recall, I think the getIngestWriter method was synchronized because there were some problems in the early days with concurrent writes. That may be a non-issue now, though, if PID generation is synchronized (the Akubra storage layer is designed to handle writes more robustly). Another potential issue: if you're creating a hierarchical tree of objects in parallel, and the ingest of a parent object fails: you could be left with orphaned children. But that's something that should be checked and handled higher up the stack, with some