Re: [fcrepo-dev] RE : Ingest performances

aj...@virginia.edu Tue, 06 Nov 2012 09:47:08 -0800

There is a bit of history to this discussion:

https://jira.duraspace.org/browse/FCREPO-1024


---
A. Soroka
Software & Systems Engineering :: Online Library Environment
the University of Virginia Library

On Nov 6, 2012, at 11:53 AM, Scott Prater wrote:

> Nicolas,
> 
> That's very encouraging... this problem has been on my radar for years.
> 
> Not being very familiar with the details of the internals of 
> DefaultDOManager, I can't comment concretely on the merits of your patch, but 
> I would be especially interested in tests that provoke race conditions and 
> concurrent writes, and makes sure Fedora handles those situations cleanly: 
> even though PID generation is synchronized, would it still be possible for 
> Fedora to attempt to write the same datastreams, provision identical values 
> in the resource index, etc. in different threads once the object has been 
> created and the PID assigned? As I dimly recall, I think the getIngestWriter 
> method was synchronized because there were some problems in the early days 
> with concurrent writes. That may be a non-issue now, though, if PID 
> generation is synchronized (the Akubra storage layer is designed to handle 
> writes more robustly).
> 
> Another potential issue: if you're creating a hierarchical tree of objects in 
> parallel, and the ingest of a parent object fails: you could be left with 
> orphaned children. But that's something that should be checked and handled 
> higher up the stack, with some transaction/rollback logic.
> 
> -- Scott
> 
> On 11/06/12, Nicolas Hervé  wrote:
>> 
>> 
>> 
>> Hi,
>> 
>> I think I've found the main problem with massive parallel ingestion.
>> 
>> I'm working with the last github snapshot.
>> 
>> In org.fcrepo.server.storage.DefaultDOManager, the getIngestWriter method 
>> should not be synchronized as it seems there is only a single instance of 
>> that class for the server. The internal objects of this class seem to be 
>> correctly synchronized (pid generation) of new objects are recreated on each 
>> call (inside Translator, a new DODeserializer is created and the same 
>> happens inside the Validator).
>> 
>> I've tested with FOXML ingestion and now almost all the CPUs are used. I've 
>> not been deeper to check that every inserted object is not corrupted, but 
>> after a quick look, it seems OK. I guess the same kind of patch could also 
>> apply on object deletion.
>> 
>> If one of you that better understand that part could have a look, it seems 
>> it would be a nice patch, not too hard to test, with great performance 
>> improvements.
>> 
>> Regards,
>> 
>> Nicolas HERVE
>> 
>> 
>> On 28/09/2012 11:25, Nicolas Hervé wrote:
>> 
>> 
>>> Hi,
>>> 
>>> indeed, it seems we are exactly in the same configuration (millions of DO 
>>> with some metadata and external content) with almost the same hardware. 
>>> I've not identified the bottleneck in the massive parallel ingestion 
>>> process right now, but I highly suspect a synchronized portion of code 
>>> somewhere in the chain. I hope Edwin could say more about this :-)
>>> 
>>> For the querying of dc fields, index have to be created in the Mysql schema 
>>> and SQL queries are far from being optimal. Currently I only patched for my 
>>> own purposes (my datamodel / my queries) and I bypass some code portions in 
>>> the following classes :
>>> 
>>> org.fcrepo.server.search.FieldSearchSQLImpl
>>> org.fcrepo.server.search.FieldSearchResultSQLImpl
>>> 
>>> I'm really new to Fedora Commons but, from what I understand, these SQL 
>>> part is quite old. Changing them for optimizations purposes could imply 
>>> behaviour changes for other people. That's why I don't think simple patches 
>>> could do the job. It would need a complete refactoring. That could only be 
>>> done with a global point of view on the different way this classes are used 
>>> in the different contexts where Fedora instances are running.
>>> 
>>> Feel free to contact me to discuss this more precisely.
>>> 
>>> Regards,
>>> 
>>> Nicolas HERVE
>>> &#43;33 1 49 83 21 66 (GMT &#43; 2)
>>> 
>>> On 27/09/2012 18:23, Jason V wrote:
>>> 
>>> 
>>>> Hi Nicolas, 
>>>> 
>>>> My name is Jason Varghese and I'm a senior developer at the New York 
>>>> Public Library. I think you are doing work similar to what I am presently 
>>>> doing based on reading some of your posts. 
>>>> 
>>>> We have a relatively large scale Fedora implementation here. We've had all 
>>>> the hardware in place for some time and are in the process of migrating 
>>>> from a large homegrown repository to a Fedora based platform. We have a 
>>>> single Fedora ingest machine and 3 Fedora readers. The ingest machine 
>>>> alone is 4 x 6 core processors w/ 128GB RAM. I'm in the process of 
>>>> generating about 1 million&#43; digital objects and attaching to each DO 
>>>> all the metadata (as managed content datastreams) and all the digital 
>>>> assets (as external content datastreams). The digital assets currently are 
>>>> about 183 TB of content (this is replicated at two sites). I have a 
>>>> multithreaded java client I wrote to accomplish the task for the Fedora 
>>>> ingest/DO generation and I use the Mediashelf REST API client for 
>>>> connectivity to Fedora. I was able to successfully ingest 10's of 
>>>> thousands of digital objects, but really need ensure this process performs 
>>>> optimally and scales for millions of objects. What bottlenecks were you 
>>>> able to identify when running your multithreaded ingest process? Look 
>>>> forward to learning/sharing experiences from this process with you and the 
>>>> community and possibly collaborating. Thanks
>>>> 
>>>> Jason Varghese
>>>> NYPL
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>>  Got visibility? Most devs has no idea what their production app looks 
>>> like. Find out how fast your code is with AppDynamics Lite. 
>>> http://ad.doubleclick.net/clk;262219671;13503038;y? 
>>> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
>>> 
>>> 
>>> 
>>> _______________________________________________ Fedora-commons-users 
>>> mailing list fedora-commons-us...@lists.sourceforge.net 
>>> <fedora-commons-us...@lists.sourceforge.net> 
>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users 
>>> 
> 
> --
> -- 
> Scott Prater
> Library, Instructional, and Research Applications (LIRA)
> Division of Information Technology (DoIT)
> University of Wisconsin - Madison
> pra...@wisc.edu
> 
> ------------------------------------------------------------------------------
> LogMeIn Central: Instant, anywhere, Remote PC access and management.
> Stay in control, update software, and manage PCs from one command center
> Diagnose problems and improve visibility into emerging IT issues
> Automate, monitor and manage. Do more in less time with Central
> http://p.sf.net/sfu/logmein12331_d2d
> _______________________________________________
> Fedora-commons-developers mailing list
> Fedora-commons-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers


------------------------------------------------------------------------------
LogMeIn Central: Instant, anywhere, Remote PC access and management.
Stay in control, update software, and manage PCs from one command center
Diagnose problems and improve visibility into emerging IT issues
Automate, monitor and manage. Do more in less time with Central
http://p.sf.net/sfu/logmein12331_d2d
_______________________________________________
Fedora-commons-developers mailing list
Fedora-commons-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers

Re: [fcrepo-dev] RE : Ingest performances

Reply via email to