[ https://issues.apache.org/jira/browse/PIG-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237884#comment-13237884 ]
Zhijie Shen commented on PIG-1314: ---------------------------------- Hi folks, Below is my proposal draft. Any comments are welcome:-) == Proposal Title: Adding the Datetime Type as a Primitive for Pig Student Name: Zhijie Shen Student E-mail: zjshe...@gmail.com Organization/Project: Apache Software Foundation - Pig Assigned Mentor: Daniel Dai /Russell Jurney Proposal Abstract: Apache Pig is a platform for analyzing large data sets based on Hadoop. Currently Pig does not support the primitive datetime type [1], which is a desired feature to be implemented. In this proposal, I explain my plan to implement the primitive datetime type, including the details of my solution and schedule. Additionally, I briefly introduce my background and the motivation of applying GSoC'12. Detailed Description: 1. Understanding of the Project 1.1 What is Apache Pig? Apache Pig is a platform for analyzing large data sets. Notably, at Yahoo! 40% of all Hadoop jobs are run with Pig [5]. Pig has is own dataflow language, named Pig Latin, which encapsulates map/reduce jobs step-by-step, and offers the relational primitives such as LOAD, FOREACH, GROUP, FILTER and JOIN. Pig provides many built-in functions, but also allow users to define their user-defined functions (UDFs) to achieve particular purposes. There are more benefits: Pig can operates on the plain files directly without any schema information; it has a flexible, nested data model, which is more compatible with that of major programming languages; it provides a debugging environment. 1.2 Why primitive datetime type is required? Datetime is a conventional data type in many of database management systems as well as programming languages. Within the Hadoop ecosystem, Hive, which is an analog of Pig, also supports the primitive datetime type (timestamp actually). In contrast, Pig does not fully support this type. Currently, users can only use the string type for the datetime data, and rely on the UDF which takes datetime strings. However, Pig is supposed to primarily parse log data, and most log data has attributes in the datetime type. Consequently, it is desired for Pig to support the datetime type as a primitive. By doing so, we can expect the following benefits: a more compact serialized format, working with conventional operators (+/-/==/!=/</>), a dedicated faster comparator, being sortable, fewer times of runtime conversion from string, and relieving users from deciding the input datetime string format. 2. Roadmap of Implementing the New Feature 2.1 To Do List 2.1.1 Adding Support in Antlr Parser Pig Latin supports the assign data type explicitly, such that the “datetime” keyword and some constants, such as “now()” and “today()” can be recognized. The related syntax needs to be added into 5 antlr scripts: AliasMasker.g, AstPrinter.g, AstValidator.g, LogicalPlanGenerator.g, QueryParser.g. 2.1.2 Adding Datetime as a Primitive The dateime type should be added into the DataType class, and the basic conversion between it and other data types need to be defined. Previously, the internal data structure relies on Joda datetime data type, which is more powerful than java.util.DateTime, but much easier than java.util.Calendar. Hence it is wise to keep this convention. Moreover, be careful that implicit type cast from/to the datetime type is not allowed. I also need to change the LoadCaster and StoreCaster interfaces to include bytesToDateTime/toBytes(DateTime) method, and add details to the classes that implemented these two interfaces. In addition, I need override +/-/==/!=/</> operators for the datetime type, mapping the to some bulitin EvalFuncs. The TypeCheckingExpVisitor class needs to be modified as well to support the datetime type vailidation. One important issue is that according to my previous experience, the data type related code in Pig is widely spread, such that I need to be careful all the related parts are touched. 2.1.3 Refactoring of the Datetime Related UDFs Thanks Russell Jurney for having implemented a number of useful datetime related UDFs, which can be utilized for the primitive datetime type as well. Part of the UDF Classes located in the “org.apache.pig.piggybank.evaluation.datetime” package under the “contrib” folder need to be move to the “org.apache.pig.builtin” package under the “src” folder. Below are the related UDFs: int DiffDate(DateTime d1, DateTime d2) int YearsBetween(DateTime d1, DateTime d2) int MonthsBetween(DateTime d1, DateTime d2) int DaysBetween(DateTime d1, DateTime d2) int HoursBetween(DateTime d1, DateTime d2) int MinutesBetween(DateTime d1, DateTime d2) int SecondsBetween(DateTime d1, DateTime d2) int GetYear(DateTime d1) int GetMonth(DateTime d1) int GetDate(DateTime d1) int GetHour(DateTime d1) int GetMinute(DateTime d1) int GetSecond(DateTime d1) DateTime DateAdd(DateTime d1) String ToString(DateTime d, String format) (Probably rename it DateTimeFormat) The remaining UDFs can be eliminated, while their logics can be used in the primitive type conversion part, which has been introduced in the previous section. Below are the UDFs of this kind: DateTime ToDate(String s) DateTime ToDate(String s, String format) DateTime ToDate(String s, String format, String timezone) DateTime toDate(long t) String ToString(DateTime d) long ToUnixTime(DateTime d) Probably the following additional UDFs are also required, I need to discuss these with the community: DateTime Now() DateTime Today() bool IsDateTime(String s) 2.1.4 Test Cases A large number of test cases are required to test the parser, the datatime operations and conversion, and loading from / storing into the secondary storage. 2.1.5 Documentation A user manual is required to describe how to use datetime primitive, such as the input format, the supported built-in functions. 2.2 Project Schedule During the summer, I will have not much workload except writing my Ph.D. thesis. Hence it is possible for me to spend around 40 hours per week on this project. The concrete schedule are summarized as follows: Present - May 20 (before official start of summer of code): Reading the related code in detail, and keeping touch with the community to clarify some issues, such as the necessary built-in UDFs and the rules of data conversion. May 21 - Jun 3 (two weeks): Adding the datetime into the primitive type list, and completing the functionality of parsing the datetime keyword and constraints, such that the string representing a datetime can be recognized from Pig Lating scripts. Jun 4 - Jun 24 (thee weeks): Implementing type conversion (from/to string) and loading/storing cast functionality. After this step, data of the datetime type can be correctly reading from/storing into the secondary storage. Jun 25 - Jul 8 (two weeks until mid-term evaluation): Completing the remaining part of the type conversion (e.g., between the datatime type and the long type), dealing with some issues that have not been foreseen yet, and preparing for the mid-term evaluation. Jul 9 - Jul 29 (three weeks): Refactoring the datetime related UDFs, adding new required UDFs, and overloading the primitive operators, such that all the defined operations on datetime values are supported after this step. Jul 30 - Aug 5 (one week): Writing the test cases to systematically verify the code, debugging the possible bugs. After this step, the coding part is nearly done. Aug 6 - Aug 12 (one week until final evaluation ): Documenting the user manual to show how to work with the datetime type, and preparing for the final evaluation. Additional Information: I am a Ph.D. student from National University of Singapore. My research topics are large scale multimedia systems, geo-referenced video systems and P2P video streaming. In addition to research, I love programming and have long-term experience in several languages, including Java. Moreover, I am quite interested in distributed systems and big data, and have acquired solid background knowledge. I used to take the course - "Parallel and Distributed Databases", drafted a survey of the cloud storage systems (including Pig) [4] and obtained the A+ score. Notably, I am a open source advocate, and have contributed to it to some extent. Last year, I have participated into GSoC with a Pig project. I successfully implemented the nested cross feature [2]. And I overfulfiled my proposed task by contributing one more patch of adding the primitive boolean type [3], which is somewhat similar to the task proposed for this year's GsoC. Therefore, I am quite familiar with this task and confident of completing it on time. Last but not least, I enjoy the long term participation into the Pig community, and am willing to keep contributing to it. Reference: [1] https://issues.apache.org/jira/browse/PIG-1314W [2] https://issues.apache.org/jira/browse/PIG-1916 [3] https://issues.apache.org/jira/browse/PIG-1429 [4] http://www.comp.nus.edu.sg/~z-shen/survey.pdf [5] http://wiki.apache.org/pig/OldFrontPage > Add DateTime Support to Pig > --------------------------- > > Key: PIG-1314 > URL: https://issues.apache.org/jira/browse/PIG-1314 > Project: Pig > Issue Type: Bug > Components: data > Affects Versions: 0.7.0 > Reporter: Russell Jurney > Assignee: Russell Jurney > Labels: gsoc2012 > Original Estimate: 672h > Remaining Estimate: 672h > > Hadoop/Pig are primarily used to parse log data, and most logs have a > timestamp component. Therefore Pig should support dates as a primitive. > Can someone familiar with adding types to pig comment on how hard this is? > We're looking at doing this, rather than use UDFs. Is this a patch that > would be accepted? > This is a candidate project for Google summer of code 2012. More information > about the program can be found at > https://cwiki.apache.org/confluence/display/PIG/GSoc2012 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira