[ 
https://issues.apache.org/jira/browse/TAJO-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862234#comment-13862234
 ] 

Jihoon Son edited comment on TAJO-472 at 1/4/14 7:25 AM:
---------------------------------------------------------

Min, thanks for your detailed comments. 
To reduce the cost of data shuffling, solutions which you suggest above are 
already implemented in Tajo (ex, replicated join) or planned (ex, pull transfer 
model). 
Also, your last comment on the early data shuffling is definitely necessary. 
As soon as we are prepared, we should start the above works. 

Anyway, I'm very happy for that we point out the same problem. I investigated 
your proposal including Inforbright's paper, and have a couple of things to 
discuss. (Actually, I roughly read the paper. So, please forgive me if there 
are some misunderstandings for it.) 

First, cached tables are stored on HDFS with being partitioned by hash in your 
proposal. This seems to handle the first case of join which you said above. 
However, the hash partitioning is a logical layout in Tajo. That is, rows seem 
to be stored in multiple HDFS directories according to their hash results, but 
its data are randomly distributed to HDFS datanodes. Thus, I think that it is 
hard to pre-partition the cached tables by using the PARTITION BY HASH clause. 

Second, you use a columnar file format which is like Inforbrights'. In the 
columnar format, a table is horizontally partitioned into _row packs_, and then 
each row pack is stored as multiple _data packs_ for each column. In other 
words, columns in a row pack are stored in a separate data pack (see the first 
phrase of Section 2 of the paper). However, separated data packs of a row pack 
might be stored in different datanodes in HDFS. This means that reading of a 
row might incur a network communication to read column values stored in 
multiple datanodes. To handle this problem, RCFile and Trevni break a table 
into multiple row groups, and make each row group be placed in an HDFS block.  
(Actually, multiple row groups can be in an HDFS block in RCFile.) In CIF (1), 
a split directory (that is a kind of row pack) is stored as multiple files for 
each column, but files of a split directory are placed to the same datanode by 
implementing a new block placement policy of HDFS. So, I wonder how you handle 
this problem.

Third, each block(data pack) has an equivalent number of rows in your proposal. 
I'm wondering why each block has the equivalent number of rows.

Fourth, what is the meaning of idle workers? Are they workers that are not 
selected to store the cached table?

Since I have a great interest in your proposal, and left a long comment.
I'll wait for your anwser.

Thanks,
Jihoon

(1) Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, and Sandeep Tata. 
2011. Column-oriented storage techniques for MapReduce. Proc. VLDB Endow. 4, 7 
(April 2011), 419-429.
http://pages.cs.wisc.edu/~jignesh/publ/colMR.pdf


was (Author: jihoonson):
Min, thanks for your detailed comments. 
To reduce the cost of data shuffling, solutions which you suggest above are 
already implemented in Tajo (ex, replicated join) or planned (ex, pull transfer 
model). 
Also, your last comment on the early data shuffling is definitely necessary. 
As soon as we are prepared, we should start the above works. 

Anyway, I'm very happy for that we point out the same problem. I investigated 
your proposal including Inforbright's paper, and have a couple of things to 
discuss. (Actually, I roughly read the paper. So, please forgive me if there 
are some misunderstandings for it.) 

First, cached tables are stored on HDFS with being partitioned by hash in your 
proposal. This seems to handle the first case of join which you said above. 
However, the hash partitioning is a logical layout in Tajo. That is, rows seem 
to be stored in multiple HDFS directories according to their hash results, but 
its data are randomly distributed to HDFS datanodes. Thus, I think that it is 
hard to pre-partition the cached tables by using the PARTITION BY HASH clause. 

Second, you use a columnar file format which is like Inforbrights'. In the 
columnar format, a table is horizontally partitioned into _row packs_, and then 
each row pack is stored as multiple _data packs_ for each column. In other 
words, columns in a row pack are stored in a separate data pack (see the first 
phrase of Section 2 of the paper). However, separated data packs of a row pack 
might be stored in different datanodes in HDFS. This means that reading of a 
row might incur a network communication to read column values stored in 
multiple datanodes. To handle this problem, RCFile and Trevni break a table 
into multiple row groups, and make each row group be placed in an HDFS block.  
(Actually, multiple row groups can be in an HDFS block in RCFile.) In CIF (1), 
a split directory (that is a kind of row pack) is stored as multiple files for 
each column, but files of a split directory are placed to the same datanode by 
implementing a new block placement policy of HDFS. So, I wonder how you handle 
this problem.

Third, each block(data pack) has an equivalent number of rows in your proposal. 
I'm wondering why each block has the equivalent number of rows.

Fourth, what is the meaning of idle workers? Are they workers that are not 
selected to store the cached table?

Since I have a great interest in your proposal, and left a long comment.
I'll wait for your anwser.

Thanks,
Jihoon

(1) Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, and Sandeep Tata. 
2011. Column-oriented storage techniques for MapReduce. Proc. VLDB Endow. 4, 7 
(April 2011), 419-429.

> Umbrella ticket for accelerating query speed through memory cached table
> ------------------------------------------------------------------------
>
>                 Key: TAJO-472
>                 URL: https://issues.apache.org/jira/browse/TAJO-472
>             Project: Tajo
>          Issue Type: New Feature
>          Components: distributed query plan, physical operator
>            Reporter: Min Zhou
>            Assignee: Min Zhou
>         Attachments: TAJO-472 Proposal.pdf
>
>
> Previously, I was involved as a technical expert into an in-memory database 
> for on-line businesses in Alibaba group. That's  an internal project, which 
> can do group by aggregation on billions of rows in less than 1 second.  
> I'd like to apply this technology into tajo, make it much faster than it is. 
> From some benchmark,  we believe that spark&shark currently is the fastest 
> solution among all the open source interactive query system , such as impala, 
> presto, tajo.  The main reason is that it benefit from in-memory data. 
> I will take memory cached table as my first step to  accelerate query speed 
> of tajo. Actually , this is the reason why I concerned at table partition 
> during Xmas and new year holidays. 
> Will submit a proposal soon.
>   



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to