Hi Keith, Thanks for contacting us. Yes this is precisely the type of thing that OODT can help you with.
As a start, I would recommend reading this guide that shows you how to use the algorithm wrapper, CAS-PGE. You can build a workflow of several of these wrappers to push out your production pipeline: https://cwiki.apache.org/confluence/display/OODT/CAS-PGE+Learn+by+Example In addition to the above guide, I would start with installing OODT RADIX, the quick installer: https://cwiki.apache.org/confluence/display/OODT/RADiX+Powered+By+OODT Once RADIX is installed, then edit your CAS-PGE algorithm wrappers and write some config files. Then test out your production pipeline. If you run into trouble with your CAS-PGE here’s an FAQ: https://cwiki.apache.org/confluence/display/OODT/CAS-PGE+Help+and+Documentation If you want to understand more about how metadata flows in the system, you can check this out: https://cwiki.apache.org/confluence/display/OODT/Understanding+the+flow+of+Metadata+during+PGE+based+Processing and this: https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence Finally there are two examples of full-up OODT pipelines/deployments. The first is DRAT, which does large scale code license analysis via OODT map reduce (there is a paper in the GitHub repo you can check out): http://github.com/chrismattmann/drat/ The second is Big Translate, a large scale Map Reduce machine translation pipeline, is here: http://github.com/chrismattmann/bigtranslate/ Cheers and if we can help more let us know. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Principal Data Scientist, Engineering Administrative Office (3010) Manager, NSF & Open Source Projects Formulation and Development Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 180-503E, Mailstop: 180-503 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ On 4/3/17, 6:36 PM, "Keith Bannister" <[email protected]> wrote: Hi, I'm trying to work out whether OODT is the right framework for me. I have a radio astronomy application. Data rate is roughly 12 TB/day. Data format it a custom one with all sorts of metadata flying around (including sky direction in lat/long coordinates). The raw data is pretty huge, and I can't store it on an OODT machine. The big disk I have access to won't run OODT> Basically I want to: 1. Save the metadata of the raw data into an index somewhere. 2. Run some GPU codes over the raw data. The GPU code parameters should be set based on the metadata. 3. Save the GPU results in an archive, with even more metadata 4. Copy the raw data to a remote disk with a long-running bbcp task. 5. Delete the raw data, but keep the GPU results and all the metadata I'm having trouble finding the right documentation the describes how I can do this. Can you give me a top level page? (I've looked at the wiki, but it's a bit tricky to work out where to start). K -- KEITH BANNISTER | Principal Research Engineer CSIRO Astronomy and Space Science T +61 2 9372 4295 E [email protected]
