Thanks Cam, for the use cases, and insight. Cheers, Chris
On 12/13/12 9:03 PM, "Cameron Goodale" <[email protected]> wrote: >Chintu, > >I see that your test data volume is 262GB, but I am curious about the make >up of the data. On average what is your file size and how many files? > >The reason I ask is because the process of extraction and ingestion can >vary wildly. On the LMMP project I was ingesting 12GB DEMs over NFS and >it >was a slow process. It was basically serial with 1CR+1FM, but we didn't >have a requirement to push large volumes of data. > >On our recent Snow Data System I am processing 160 workflow jobs in >parallel and OODT could handle the load, it turned out the filesystem was >our major bottleneck. We used a SAN initially when doing development, but >when we increased the number of jobs in parallel the I/O became so bad we >moved to GlusterFS. GlusterFS had speed improvements over the SAN, but we >had to be careful about heavy writing, moving, deleting since the >clustering would try to replicate the data. Turns out Gluster is great >for >heavy writting OR heavy reading, but not both at the same time. Finally >we >are using NAS and it works great. > >My point is the file system plays a major role in performance when >ingesting data. The ultimate speed test would be if you could actually >write the data into the final archive directory and basically do an >ingestion in place (skip data transfer entirely), but I know that is >rarely >possible. > >This is an interesting challenge to see what configuration will yield the >best through put/performance. I look forward to hearing more about your >progress on this. > > >Best Regards, > > > >Cameron > > >On Wed, Dec 12, 2012 at 7:28 PM, Mattmann, Chris A (388J) < >[email protected]> wrote: > >> Hi Chintu, >> >> From: <Mistry>, "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC] >> (GSFC-586.0)" <[email protected]<mailto:[email protected]>> >> Date: Wednesday, December 12, 2012 12:02 PM >> To: jpluser <[email protected]<mailto: >> [email protected]>>, "[email protected]<mailto: >> [email protected]>" <[email protected]<mailto:[email protected]>> >> Subject: Re: OODT 0.3 branch >> >> If you are saying that FM can handle multiple connections at one time, >> >> Yep I'm saying that it can. >> >> then multiple crawlers pointing to same FM should increase performance >> significantly. >> >> Well that really depends to be honest. It sounds like you guys are >>hitting >> an IO bottleneck potentially in data transfer? What file sizes are you >> transferring? If you are IO bound on the data transfer part, the product >> isn't fully ingested until: >> >> >> 1. it's entry is added to the catalog >> 2. The data transfer finishes >> >> Are you checking the FM for status along the way? Also realize that the >>FM >> will never be faster than the file system, so if it takes the file >>system X >> minutes to transfer a file F1, Y to transfer F2, and Z to transfer F3, >>then >> you still have to wait at least the max(X,Y,Z) time, regardless for the >>3 >> ingestions to complete. >> >> But that¹s not what we saw in our tests. >> >> For example, >> I saw barely 2 minutes performance difference between 2FM-6CR and >>3FM-6CR. >> >> 1) 2 hour 6 minutes to process 262G (1FM 3CR - 3CR to 1FM) >> 2) 1 hour 58 minutes to process 262G (1FM 6CR - 6CR to 1FM) >> 3) 1 hour 39 minutes to process 262G (2FM 6CR - 3CR to 1FM) >> 4) 1 hour 39 minutes to process 262G (2FM 9CR - 4+CR to 1FM) >> 5) 1 hour 37 minutes to process 262G (3FM 9CR - 3CR to 1FM) >> 6) 2 hour to process 262G (3FM 20CR - 6+CR to 1FM) >> 7) 28 minutes to process 262G (6FM 9CR - 1+CR to 1FM) => This is >>my >> latest test and this is good number. >> >> What would be interesting is simply looking at the speed for how long it >> takes to cp the files (which I bet is what's happening) versus mv'ing >>the >> files by hand. If mv is faster, I'd: >> >> >> 1. Implement a Data Transfer implementation that simply replaces the >> calls to FileUtils.copyFile or .moveFile with systemCalls (see >>ExecHelper >> from oodt-commons) to UNIX equivalents. >> 2. Plug that data transfer in to your crawler invocations via the cmd >> line. >> >> HTH! >> >> Cheers, >> Chris >> >> >> From: <Mattmann>, Chris A <[email protected]<mailto: >> [email protected]>> >> Date: Wednesday, December 12, 2012 2:51 PM >> To: "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES >>INC]" < >> [email protected]<mailto:[email protected]>>, " >> [email protected]<mailto:[email protected]>" <[email protected] >> <mailto:[email protected]>> >> Subject: Re: OODT 0.3 branch >> >> Hey Chintu, >> >> From: <Mistry>, "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC] >> (GSFC-586.0)" <[email protected]<mailto:[email protected]>> >> Date: Tuesday, December 11, 2012 2:41 PM >> To: jpluser <[email protected]<mailto: >> [email protected]>>, "[email protected]<mailto: >> [email protected]>" <[email protected]<mailto:[email protected]>> >> Subject: Re: OODT 0.3 branch >> >> Answers inline below. >> >> ---snip >> >> Gotcha, so you are using different product types. So, each crawler is >> crawling various product types in each one of the staging area dirs, >>that >> looks like e.g., >> >> /STAGING_AREA_BASE >> /dir1 1st crawler >> - file1 of product type 1 >> - file2 of product type 3 >> >> /dir2 2nd crawler >> - file3 of product type 3 >> >> /dir3 3rd crawler >> - file4 of product type 2 >> >> Is that what the staging area looks like? - YES >> >> And then your FM is ingesting all 3 product types (I just picked 3 >> arbitrarily could have been N) into: >> >> ARCHIVE_BASE/{ProductTypeName}/{YYYYMMDD} >> >> Correct? - YES >> >> If so, I would imagine if FM1 and FM2 and FM3 would actually speed up >>the >> ingestion process compared to just using 1 FM with 1, or 2 or 3 crawlers >> all talking to it. >> >> Let me ask a few more questions: >> >> Do you see e.g., in the above example that file4 is ingested before >>file2? >> What about file3 before file2? If not, there is something wiggy going >>on. >> - I have not checked that. I guess I can check that. Can FM >>handle >> multiple connections at the same time ? >> >> >> Yep FM can handle multiple connections at one time up to a limit (I >>think >> hard defaulted to ~100-200 by the underlying XMLRPC 2.1 library). We're >> using an old library currently but have a goal to upgrade to the latest >> version where I think this # is configurable. >> >> Cheers, >> Chris >> >>
