[
https://issues.apache.org/jira/browse/TAJO-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862234#comment-13862234
]
Jihoon Son edited comment on TAJO-472 at 1/4/14 7:25 AM:
---------------------------------------------------------
Min, thanks for your detailed comments.
To reduce the cost of data shuffling, solutions which you suggest above are
already implemented in Tajo (ex, replicated join) or planned (ex, pull transfer
model).
Also, your last comment on the early data shuffling is definitely necessary.
As soon as we are prepared, we should start the above works.
Anyway, I'm very happy for that we point out the same problem. I investigated
your proposal including Inforbright's paper, and have a couple of things to
discuss. (Actually, I roughly read the paper. So, please forgive me if there
are some misunderstandings for it.)
First, cached tables are stored on HDFS with being partitioned by hash in your
proposal. This seems to handle the first case of join which you said above.
However, the hash partitioning is a logical layout in Tajo. That is, rows seem
to be stored in multiple HDFS directories according to their hash results, but
its data are randomly distributed to HDFS datanodes. Thus, I think that it is
hard to pre-partition the cached tables by using the PARTITION BY HASH clause.
Second, you use a columnar file format which is like Inforbrights'. In the
columnar format, a table is horizontally partitioned into _row packs_, and then
each row pack is stored as multiple _data packs_ for each column. In other
words, columns in a row pack are stored in a separate data pack (see the first
phrase of Section 2 of the paper). However, separated data packs of a row pack
might be stored in different datanodes in HDFS. This means that reading of a
row might incur a network communication to read column values stored in
multiple datanodes. To handle this problem, RCFile and Trevni break a table
into multiple row groups, and make each row group be placed in an HDFS block.
(Actually, multiple row groups can be in an HDFS block in RCFile.) In CIF (1),
a split directory (that is a kind of row pack) is stored as multiple files for
each column, but files of a split directory are placed to the same datanode by
implementing a new block placement policy of HDFS. So, I wonder how you handle
this problem.
Third, each block(data pack) has an equivalent number of rows in your proposal.
I'm wondering why each block has the equivalent number of rows.
Fourth, what is the meaning of idle workers? Are they workers that are not
selected to store the cached table?
Since I have a great interest in your proposal, and left a long comment.
I'll wait for your anwser.
Thanks,
Jihoon
(1) Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, and Sandeep Tata.
2011. Column-oriented storage techniques for MapReduce. Proc. VLDB Endow. 4, 7
(April 2011), 419-429.
http://pages.cs.wisc.edu/~jignesh/publ/colMR.pdf
was (Author: jihoonson):
Min, thanks for your detailed comments.
To reduce the cost of data shuffling, solutions which you suggest above are
already implemented in Tajo (ex, replicated join) or planned (ex, pull transfer
model).
Also, your last comment on the early data shuffling is definitely necessary.
As soon as we are prepared, we should start the above works.
Anyway, I'm very happy for that we point out the same problem. I investigated
your proposal including Inforbright's paper, and have a couple of things to
discuss. (Actually, I roughly read the paper. So, please forgive me if there
are some misunderstandings for it.)
First, cached tables are stored on HDFS with being partitioned by hash in your
proposal. This seems to handle the first case of join which you said above.
However, the hash partitioning is a logical layout in Tajo. That is, rows seem
to be stored in multiple HDFS directories according to their hash results, but
its data are randomly distributed to HDFS datanodes. Thus, I think that it is
hard to pre-partition the cached tables by using the PARTITION BY HASH clause.
Second, you use a columnar file format which is like Inforbrights'. In the
columnar format, a table is horizontally partitioned into _row packs_, and then
each row pack is stored as multiple _data packs_ for each column. In other
words, columns in a row pack are stored in a separate data pack (see the first
phrase of Section 2 of the paper). However, separated data packs of a row pack
might be stored in different datanodes in HDFS. This means that reading of a
row might incur a network communication to read column values stored in
multiple datanodes. To handle this problem, RCFile and Trevni break a table
into multiple row groups, and make each row group be placed in an HDFS block.
(Actually, multiple row groups can be in an HDFS block in RCFile.) In CIF (1),
a split directory (that is a kind of row pack) is stored as multiple files for
each column, but files of a split directory are placed to the same datanode by
implementing a new block placement policy of HDFS. So, I wonder how you handle
this problem.
Third, each block(data pack) has an equivalent number of rows in your proposal.
I'm wondering why each block has the equivalent number of rows.
Fourth, what is the meaning of idle workers? Are they workers that are not
selected to store the cached table?
Since I have a great interest in your proposal, and left a long comment.
I'll wait for your anwser.
Thanks,
Jihoon
(1) Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, and Sandeep Tata.
2011. Column-oriented storage techniques for MapReduce. Proc. VLDB Endow. 4, 7
(April 2011), 419-429.
> Umbrella ticket for accelerating query speed through memory cached table
> ------------------------------------------------------------------------
>
> Key: TAJO-472
> URL: https://issues.apache.org/jira/browse/TAJO-472
> Project: Tajo
> Issue Type: New Feature
> Components: distributed query plan, physical operator
> Reporter: Min Zhou
> Assignee: Min Zhou
> Attachments: TAJO-472 Proposal.pdf
>
>
> Previously, I was involved as a technical expert into an in-memory database
> for on-line businesses in Alibaba group. That's an internal project, which
> can do group by aggregation on billions of rows in less than 1 second.
> I'd like to apply this technology into tajo, make it much faster than it is.
> From some benchmark, we believe that spark&shark currently is the fastest
> solution among all the open source interactive query system , such as impala,
> presto, tajo. The main reason is that it benefit from in-memory data.
> I will take memory cached table as my first step to accelerate query speed
> of tajo. Actually , this is the reason why I concerned at table partition
> during Xmas and new year holidays.
> Will submit a proposal soon.
>
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)