Re: OODT 0.3 branch

Mattmann, Chris A (388J) Tue, 11 Dec 2012 15:26:12 -0800

Hey Chintu,

From: <Mistry>, "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC] (GSFC-586.0)" 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, December 11, 2012 2:41 PM
To: jpluser 
<[email protected]<mailto:[email protected]>>, 
"[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: OODT 0.3 branch

Answers inline below.

We will share information on apache.org at one point, but we are not there yet.

Thanks, OK,  please see inline below:

--
Chintu Mistry
NASA Goddard Space Flight Center
Bldg L40B, Room S776
Office: 240 684 0477
Mobile: 770 310 1047

From: <Mattmann>, Chris A 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, December 11, 2012 5:23 PM
To: "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]" 
<[email protected]<mailto:[email protected]>>, 
"[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: OODT 0.3 branch

Hey Chintu,

Thanks for reaching out! Replies inline below:

From: <Mistry>, "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC] (GSFC-586.0)" 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, December 11, 2012 1:50 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Cc: jpluser 
<[email protected]<mailto:[email protected]>>
Subject: OODT 0.3 branch

Hi Chris,

We are trying measure a performance of how fast filemanager+crawler is 
performing.

Here is what we are trying to do:

  *   Total data to process : 262GB
  *   3 file managers and 9 crawlers
  *    where 3 crawlers are sending file location to  file manager to process 
the file
  *   We have our own schema running on postgresql database
  *   Custom H5 Extactor using h5dump utility

Cool this sounds like an awesome test. Would you be willing to share some of 
the info on the OODT wiki?

https://cwiki.apache.org/confluence/display/OODT/Home

Questions:
1) I have tried using FileUtils.copyFile vs FileUtils.moveFile, but I don't see 
any difference in processing time. Both my LandingZone and Archive Area are 
located on same Filesystem(GPFS). It is roughly taking 100 minutes to process 
262G data. Can you shed any light on why don't we see any performance change ?

This may have to do with the way that the JDK (what version are you using?) 
implements the actual arraycopy methods, and how the apache commons-io library 
wraps those methods. Let me know what JDK version you're using and we can 
investigate it.

- java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.5) (rhel-1.50.1.11.5.el6_3-x86_64)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

OK thanks. I found this article:

http://stackoverflow.com/questions/300559/move-copy-file-operations-in-java

It doesn't really go into too much detail but the nice thing is that if you 
need a different, or faster DataTransfer, you can always sub-class or implement 
your own that makes a call to e.g., "mv" or "cp" at the UNIX level if you think 
it'll speed it up.

Looking at: 
http://commons.apache.org/io/api-release/org/apache/commons/io/FileUtils.html

http://commons.apache.org/io/api-release/org/apache/commons/io/FileUtils.html#copyFile(java.io.File,

java.io.File)<http://commons.apache.org/io/api-release/org/apache/commons/io/FileUtils.html#copyFile(java.io.File,%20java.io.File)>
http://commons.apache.org/io/api-release/org/apache/commons/io/FileUtils.html#moveFile(java.io.File,

java.io.File)<http://commons.apache.org/io/api-release/org/apache/commons/io/FileUtils.html#moveFile(java.io.File,%20java.io.File)>

Note for moveFile:

"When the destination file is on another file system, do a "copy and delete".

I wonder how it detects that? I wonder if it always thinks it's on another 
filesystem using JDK and GFS? If so that might explain what you are seeing in 
that there is no difference between copyFile and moveFile?

2) The other thing also is that I don't see any performance gain between 
running 2 FM or 3FM. I thought that I would see some performance gain due to 
concurrency. Same goes for multiple crawlers. I was hoping to see pretty 
obvious performance change if I increase number of crawlers. What are thoughts 
on running things in parallel to increase performance.

How are you situating the additional file managers? Are you having 1 crawler 
ingest to 3? Or is there a 1:1 correspondence between each crawler and FM? And, 
what do you mean by no performance gain? Do you mean that you don't see 3x 
speed in terms of e.g. Product ingestion of met into the catalog? Of file 
transfer speed?

- All 3 FM are running on one machine. Each crawler instance is crawling 
different directory. And 3 Crawlers are connected to 1st FM. Other 3 are 
connected to second FM and last 3 crawlers are connected to third FM. When I 
say performance difference between 2 and 3FM, I meant they take identically 
same amount of time to process same amount of data concurrently.

So, I think the big thing here is to understand how the crawlers work and how 
they march through files. Basically the crawl method returns a current 
directory or file listing depending on how you invoked it (and whether or not 
you called noRecur and/or crawlForDirs which define a particular crawling 
strategy). Once the file and directory snapshot is taken (from 
File#listFiles(java.io.FileFilter)), then those are marched through by the 
Crawler and processed in order. So, whether or not you are using 1 or 3 
crawlers, parallelizing the crawlers, crawling different directories won't have 
any affect.

I would love to see 3x speed if I run 3FM. I was talking about the whole ingest 
process from start to end for one file, which involves extracting metadata, 
inserting records into database and transferring file to archive location.

Are the 3 crawlers crawling the same staging area concurrently? Or are they 
separated out by buckets? And, which crawler are you using? The 
MetExtractorProductCrawler or the AutoDetectCrawler? Also, what is the 
versioning policy for the FM on a per product basis? Are all products being 
ingested of the same ProductType and ultimately of the same versioner and 
ultimate disk location?

- We are using StdProductCrawler. We don't have versioning requirement. 
Products are of different ProductTypes. We are trying to process 1 orbit full 
of data. They all get archived at "ARCHIVE_BASE/{ProductType}/YYYYMMDD" 
location.

Gotcha, so you are using different product types. So, each crawler is crawling 
various product types in each one of the staging area dirs, that looks like 
e.g.,

/STAGING_AREA_BASE
  /dir1 – 1st crawler
   - file1 of product type 1
   - file2 of product type 3

 /dir2 – 2nd crawler
   - file3 of product type 3

 /dir3 – 3rd crawler
   - file4 of product type 2

Is that what the staging area looks like?

And then your FM is ingesting all 3 product types (I just picked 3 arbitrarily 
could have been N) into:

ARCHIVE_BASE/{ProductTypeName}/{YYYYMMDD}

Correct?

If so, I would imagine if FM1 and FM2 and FM3 would actually speed up the 
ingestion process compared to just using 1 FM with 1, or 2 or 3 crawlers all 
talking to it.

Let me ask a few more questions:

Do you see e.g., in the above example that file4 is ingested before file2? What 
about file3 before file2? If not, there is something wiggy going on.

3) Like I said earlier, we are running crawler to push data to file manager. If 
I run it that way, then "data transfer(copy or move)" is happing on the crawler 
side. I can not find any way to let file manager handle "data transfer" using 
on of your runtime options. Please let me know if you guys know how to do that ?

If you want the FM to handle the transfer you have to use the low level File 
Manager Client and omit the clientTransfer option:

[chipotle:local/filemgr/bin] mattmann% ./filemgr-client
filemgr-client --url <url to xml rpc service> --operation [<operation> [params]]
operations:
--addProductType --typeName <name> --typeDesc <description> --repository <path> 
--versionClass <classname of versioning impl>
--ingestProduct --productName <name> --productStructure <Hierarchical|Flat> 
--productTypeName <name of product type> --metadataFile <file> 
[--clientTransfer --dataTransfer <java class name of data transfer factory>] 
--refs <ref1>...<refn>
--hasProduct --productName <name>
--getProductTypeByName --productTypeName <name>
--getNumProducts --productTypeName <name>
--getFirstPage --productTypeName <name>
--getNextPage --productTypeName <name> --currentPageNum <number>
--getPrevPage --productTypeName <name> --currentPageNum <number>
--getLastPage --productTypeName <name>
--getCurrentTransfer
--getCurrentTransfers
--getProductPctTransferred --productId <id> --productTypeName <name>
--getFilePctTransferred --origRef <uri>

[chipotle:local/filemgr/bin] mattmann%

That is just a CMD line exposure of the underlying FM client Java API which 
lets you do server side transfers on ingest by passing clientTransfer == false 
to this method:

http://oodt.apache.org/components/maven/xref/org/apache/oodt/cas/filemgr/system/XmlRpcFileManagerClient.html#1168

- Fair enough. I was hoping to see any cmd line option in Crawler Launcher. No 
problem.

Yep we should add it :) Can you please file a JIRA issue here:

http://issues.apache.org/jira/browse/OODT

If you have time throw up a patch for CrawlerLauncher and its action beans? 
Otherwise, I'll take care of it soon.

We have enough processing power to run multiple FM and Crawlers for 
scalability. But for some reason crawler is not scaling enough.

We'll get it scaling out for ya. Can you please provide answers to the above 
questions and we'll go from there? Thanks!

Thanks!!

Cheers,
Chris
--
Chintu Mistry
NASA Goddard Space Flight Center
Bldg L40B, Room S776
Office: 240 684 0477
Mobile: 770 310 1047

Re: OODT 0.3 branch

Reply via email to