[ 
https://issues.apache.org/jira/browse/TAJO-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182463#comment-15182463
 ] 

Jihoon Son commented on TAJO-2046:
----------------------------------

Hi [~mucahid.erenler], thanks for your interest!

Your starting point looks good. I also think we need to add a new sub-module 
like "tajo-storage-kudu" to the tajo-storage module.

Tajo has a concept of Tablespace 
(http://tajo.apache.org/docs/devel/table_management/tablespaces.html) to 
support various types of underlying storage. You can think that Tablespace is 
the abstract interface between Tajo and underlying data sources. Each 
tablespace represents the storage type where data are stored on and provides an 
interface to access them (Scanner and Appender in Tajo). 

The goal of this ticket is to add a tablespace for Kudu. Here are mandatory 
issues I think.
* Implement KuduTablespace
** Split generation: Need to consider how we create splits (fragments) to for 
distributed processing (Tablespace.getSplits() method)
** Tablespace code: 
https://github.com/apache/tajo/blob/master/tajo-storage/tajo-storage-common/src/main/java/org/apache/tajo/storage/Tablespace.java
* Implement KuduFragment
** The Fragment is similar to the split in MapReduce. It contains the 
information of which part of data will be processed by each task.
** Fragment code: 
https://github.com/apache/tajo/blob/master/tajo-storage/tajo-storage-common/src/main/java/org/apache/tajo/storage/fragment/Fragment.java
* Implement KuduScanner and KuduAppender
** Split read: Need to consider how we can read the part of data specified in 
the given fragment.
** Type conversion: Data types and internal representation should be converted 
between Tajo and Kudu.
** Projection push down: Tajo needs to be able to access only necessary columns.
** Scanner code: 
https://github.com/apache/tajo/blob/master/tajo-storage/tajo-storage-common/src/main/java/org/apache/tajo/storage/Scanner.java
** Appender code: 
https://github.com/apache/tajo/blob/master/tajo-storage/tajo-storage-common/src/main/java/org/apache/tajo/storage/Appender.java

The below issues are optional, but will be very helpful for Tajo.
* Filter push down optimization: Since Kudu can process simple predicates, Tajo 
can read data which satisfy those predicates.

If you have more questions, please feel free to ask me anytime.

Thanks,
Jihoon

> Support Kudu as one of Tajo's storage 
> --------------------------------------
>
>                 Key: TAJO-2046
>                 URL: https://issues.apache.org/jira/browse/TAJO-2046
>             Project: Tajo
>          Issue Type: New Feature
>          Components: Storage
>            Reporter: Jihoon Son
>              Labels: gsoc, gsoc2016
>
> Kudu (https://github.com/cloudera/kudu) is a newly emerging system for high 
> performance updates and analysis query processing. Supporting Kudu will also 
> give a benefit for Tajo users by simplifying their architecture and 
> decreasing analysis latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to