Hi Aditya, Apologies for delay on this one :( Thank you for your patience. Please see my inline responses.
On Tue, Feb 17, 2015 at 12:31 AM, Aditya Dhulipala <[email protected]> wrote: > Hi Lewis, > > I've been reading up on the doc you provided earlier. > Great > > I've made some progress. I've looked into the filemgr component and run a > few commands to ingest files etc. I understand how it works now. > Great > > About the potential workflow -- (This is just my initial understanding. I > could be wrong about this, please correct me) > I think I have to rewrite the entire component to conform to the avro style > specification. So this means, I need to define the scheme for all the files > inside filemanger/structs -- Product.java, ProductPage.java etc. > Yes, this is correct. The main data struxtures are documented in Avro specification format as per the patch I attached to OODT-685 https://issues.apache.org/jira/browse/OODT-658 Please check them out. There is an issues here as the DataStrutures in filemgr are dependent upon additional data structures, namely Metadata which is contained within the OODT metadata package. > > I should define the schema for each of these similar to that specified for > "User" on this link - > > http://avro.apache.org/docs/current/gettingstartedjava.html#Defining+a+schema > Absolutely correct. Please see OODT-685 > > Currently I think this piece of code (Product.java) constructs an xml file > for each product and so that the rpcClient can send it over the xml-rcp > interface to the filemgr server. Yes > This project aims to redefine this process > to send the data as a binary encoding (for smaller size, and thus smaller > latency) by using the avro protocol. > Yes this is correct. It reduces wire transfer as well as a more flexible model for reading data which has been written by a particular writer. Avro support schema evolution as well meaning that data does not need to be static i nature if we consider it from the Avro point of view. This is highly advantageous from a data archival and interoperability view. > > And then I should invoke the avro code generation tools from within > org.apache...system.XmlRpcFileManagerClient (probably have to rewrite this > module to fit Avro client specification as well) > ... probably yes. I would imagine that by the time this project is finished, there will be absolutely no references to XML anywhere. It will be entirely replaces by Avro Schema's (JSON) > > I should also make the XmlRpcFileManger (server) fit to the avro specific > implementation of the server interface. > Yes that is correct. > > I think this has to be repeated for all the components within oodt > (workflow manager etc) > Absolutely. All key services e.g FileMgr, Workflow and Resource. > > I also have some questions:- > > 1. Is there any specific reason for picking Avro over Thrift or Protocol > Buffers? > Please read upon some of Martin Kleppmann's blogs and commentary over the years on this topic http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html He did a bunch of work on Avro whilst @LinkedIn and it will really help you to read through some of his work. > 2. I also came across this answer on quora on Avro vs. XML-RPC > > http://www.quora.com/What-merits-does-Avro-RPC-have-over-XML-RPC/answer/Ted-Dunning-1?__snids__=959769040&__nsrc__=1&__filter__=all > > The author talks about another binary format - Simple Binary Encoding. And > recommends using protocol buffers for their wide use and documentation. Can > you share your thoughts about this? > I can yes. - Protocol Buffers is described as Google's Interchange format. Does this not sound a bit limiting? What happens if you want to change some of the code to fit into OODT. Are you going to fork the project and maintain your own Protocol Buffers implementation. - @Apache there is a saying EAT YOUR OWN DOG FOOD. I would much rather we implement a well founded Apache project e.g. Avro over Protocl Buffers any day of the week. Avro is also widely used. It also has a pretty excellent specification document which as you've already seen has enabled you to understand schema design. ... > > I'd also like to run some more examples of the filemgr client/server. That > way I can run some commands like these > > https://cwiki.apache.org/confluence/display/OODT/Exploring+the+OODT+File+Manager+XML-RPC+Interface > and understand the overhead caused by xml-rpc or get a sense of what the > latency of using xml-rcp is. My main justification for moving towards a replacement for XML-RPC in OODT is multi-faceted - the library is dated, - the plethora of XML in OODT is cumbersome, - none of the XML is accompanied by XSD - Avro has advanced significantly over the years and I am more familiar with it than I am other data serialization frameworks out there. It defines the Protocol layer which is a natural replacement for the XML-RPC - the Google Summer of Code project we are describing here is carving the way for a complete Avro-RPC powered REST API for each OODT service. This is a HUGE game changer for invoking remote OODT services. > Can you also share examples of filemgr servers > running in the real-world that I could query or use? > Most of the servers I am aware that are running are on VPN's and internal, secure networks so the short answer is no. This is something which we we get established once you were brought on as the GSoC student for this project I would think. > > Any other comments/suggestions are welcome! :) > > I would state that it would be really nice for you to put some of this correspondence down to a proposal of sorts. You will require a working proposal when you apply to Google. Also, please feel free, if you have time, to pick up some issues on the OODT Jira tracker. This will go a LONG way to us backing you as the preferred GSoC applicant. Thank you Lewis
