Hi Chintu, From: <Mistry>, "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC] (GSFC-586.0)" <[email protected]<mailto:[email protected]>> Date: Wednesday, December 12, 2012 12:02 PM To: jpluser <[email protected]<mailto:[email protected]>>, "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: OODT 0.3 branch
If you are saying that FM can handle multiple connections at one time, Yep I'm saying that it can. then multiple crawlers pointing to same FM should increase performance significantly. Well that really depends to be honest. It sounds like you guys are hitting an IO bottleneck potentially in data transfer? What file sizes are you transferring? If you are IO bound on the data transfer part, the product isn't fully ingested until: 1. it's entry is added to the catalog 2. The data transfer finishes Are you checking the FM for status along the way? Also realize that the FM will never be faster than the file system, so if it takes the file system X minutes to transfer a file F1, Y to transfer F2, and Z to transfer F3, then you still have to wait at least the max(X,Y,Z) time, regardless for the 3 ingestions to complete. But that’s not what we saw in our tests. For example, I saw barely 2 minutes performance difference between 2FM-6CR and 3FM-6CR. 1) 2 hour 6 minutes to process 262G (1FM 3CR - 3CR to 1FM) 2) 1 hour 58 minutes to process 262G (1FM 6CR - 6CR to 1FM) 3) 1 hour 39 minutes to process 262G (2FM 6CR - 3CR to 1FM) 4) 1 hour 39 minutes to process 262G (2FM 9CR - 4+CR to 1FM) 5) 1 hour 37 minutes to process 262G (3FM 9CR - 3CR to 1FM) 6) 2 hour to process 262G (3FM 20CR - 6+CR to 1FM) 7) 28 minutes to process 262G (6FM 9CR - 1+CR to 1FM) => This is my latest test and this is good number. What would be interesting is simply looking at the speed for how long it takes to cp the files (which I bet is what's happening) versus mv'ing the files by hand. If mv is faster, I'd: 1. Implement a Data Transfer implementation that simply replaces the calls to FileUtils.copyFile or .moveFile with systemCalls (see ExecHelper from oodt-commons) to UNIX equivalents. 2. Plug that data transfer in to your crawler invocations via the cmd line. HTH! Cheers, Chris From: <Mattmann>, Chris A <[email protected]<mailto:[email protected]>> Date: Wednesday, December 12, 2012 2:51 PM To: "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]" <[email protected]<mailto:[email protected]>>, "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: OODT 0.3 branch Hey Chintu, From: <Mistry>, "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC] (GSFC-586.0)" <[email protected]<mailto:[email protected]>> Date: Tuesday, December 11, 2012 2:41 PM To: jpluser <[email protected]<mailto:[email protected]>>, "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: OODT 0.3 branch Answers inline below. ---snip Gotcha, so you are using different product types. So, each crawler is crawling various product types in each one of the staging area dirs, that looks like e.g., /STAGING_AREA_BASE /dir1 – 1st crawler - file1 of product type 1 - file2 of product type 3 /dir2 – 2nd crawler - file3 of product type 3 /dir3 – 3rd crawler - file4 of product type 2 Is that what the staging area looks like? - YES And then your FM is ingesting all 3 product types (I just picked 3 arbitrarily could have been N) into: ARCHIVE_BASE/{ProductTypeName}/{YYYYMMDD} Correct? - YES If so, I would imagine if FM1 and FM2 and FM3 would actually speed up the ingestion process compared to just using 1 FM with 1, or 2 or 3 crawlers all talking to it. Let me ask a few more questions: Do you see e.g., in the above example that file4 is ingested before file2? What about file3 before file2? If not, there is something wiggy going on. - I have not checked that. I guess I can check that. Can FM handle multiple connections at the same time ? Yep FM can handle multiple connections at one time up to a limit (I think hard defaulted to ~100-200 by the underlying XMLRPC 2.1 library). We're using an old library currently but have a goal to upgrade to the latest version where I think this # is configurable. Cheers, Chris
