Thanks Hyunsik! I wonder if we can make a connection with Tajo to Gora on this project. Maybe we can generate a Gora based front-end to Tajo?
I'm CC'ing the Gora folks here for thoughts. Great roadmap! Cheers, Chris On 3/26/13 2:28 AM, "Apache Wiki" <[email protected]> wrote: >Dear Wiki user, > >You have subscribed to a wiki page or wiki category on "Tajo Wiki" for >change notification. > >The "Roadmap" page has been changed by HyunsikChoi: >http://wiki.apache.org/tajo/Roadmap > >Comment: >Moved the roadmap from github wiki. > >New page: >= Roadmap = > >== Milestone == > * 0.2 - first release as an incubating project focused on ASF compliance > * 0.3 - more stable API and robust features and a rudimentary cost-based >optimizer > * 0.4 - more SQL supports and more improved cost-based optimizer > * 0.5 - a native columnar execution engine > >== Long Term Plan == > * Integration with Hadoop ecosystem > * Tajo catalog needs to support HCatalog or needs to be compatible to >Hive meta. > * The native columnar execution engine > * Cost-based optimization which also includes a rewrite rule engine and >various rewrite rules > >== Short/Mid Term Plan == > * Improvement of the DAG framework > * Query is both FSM and a DAG representation. > * It would be good to separate Query to a FSM part and a DAG part. > * We need easier interface to edit and build DAGs. > * RCFile > * In the current implementation, RCFile is not compatible to Hive's one >because Tajo's RCFile uses Datum to (de)serialize data. So, we will have >additional RCFile wrapper class compatible to Hive's files. > * ORCFile > * It looks promising. We need to port ORCFile. > * Trevni > * TrevniScanner works well in most cases. However, it doesn't support >null value. We need to handle it. > * hadoop security in tajo-rpc > * tajo-rpc does not support hadoop security. Since Tajo will be a part >of Hadoop ecosystem, we need to apply hadoop security to tajo-rpc. > * Intermediate Data Format > * As I mentioned above, Tajo uses CSV as the intermediatee data >format. It may cause CPU overhead and is relatively large to be >transmitted via networks. We need to change it. > * JDBC/ODBC drivers > * Tajo is a relational DW system. If we have such connectors, it can be >easily integrated with existing BI and OLAP tools. > * Restful API > * It's very useful for web-based applications. > * Proper resource allocation for SubQuery (i.e., Execution Block in PPT) > * SubQuery is one step of multiple query steps. For each subquery, >QueryMaster launches TaskRunners via Yarn, and the launched TaskRunners >are reused within a subquery. > * Now, QueryMaster assigns the fixed-sized resource (2G memory) to >subqueries regardless of necessary resource. We need to improve it to >allocate proper resources to subqueries. For example, QueryMaster assigns >1G to one subquery for only scan or assigns 2G to another subquery >including joins. > * Error handling of TajoCli > * TajoCli is a command line interface that uses Jline2. However, its >error handling is awful. It frequently halts when trivial exceptions >onccur. > * SQL data types > * Currently, Tajo provides data types (i.e., byte, bool, int, long, >float, double, bytes, and string) based on Java primitive types. Tajo >should support SQL standard data types. > * Local mode > * Queries are always executed in a distributed mode. In other words, >it always uses Yarn. However, it is inconvenience for debugging and is >inefficient in single machine. We need to implement something for local >mode. > * Parallel launch of containers > * Currently, node containers are executed sequentially (see >TaskRunnerLauncherImpl.java). It looks very inefficient. We can improve >it by using ExecutorService. > * Output commit > * In some cases, Tajo is fault tolerance. It requires output commit >mechanism. However, Tajo does not support it, and we need this feature. > * Broadcast join and Limit operator > * As I mentioned before, they are disabled after Yarn port. We should >enable them. > * HbaseScanner/Appender > * Hbase will be a great storage for Tajo.
