[
https://issues.apache.org/jira/browse/TAJO-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862751#comment-13862751
]
Jihoon Son commented on TAJO-472:
---------------------------------
Min, sorry about for my misunderstanding. Thanks to your additional comments, I
finally understand your proposal.
According to your proposal, a cached table is stored on HDFS with the hash
partitioning for the reliability. Once a table is stored on HDFS, the tajo
master selects a number of workers who cache the partitioned table into their
memory. Thus, the data are pre-shuffled when workers finish to download
partitioned data from HDFS into their local disks. I think that this is a good
prototype for data cache.
It is great that we build indexes on data packs. This is definitely necessary
for partition pruning. However, we also have to consider the performance of the
sequential scan. As you said, the index is useful only when the selectivity is
quite low, and thus the sequential scan is useful for other cases. When the
data packs have the same number of rows, their byte lengths are different
according to their contents (or types). It means, more file opens and closes
are required during the sequential scan if the size of a value is very small
like byte. Definitely, it makes the sequential scan slower. How about change to
have the same byte length for data packs?
In addition, we need to balance the data distribution for fully utilizing the
parallelism. Thus, when the tajo master selects workers, it should task account
into the data distribution as well as workers' remaining resources.
Thanks,
Jihoon
> Umbrella ticket for accelerating query speed through memory cached table
> ------------------------------------------------------------------------
>
> Key: TAJO-472
> URL: https://issues.apache.org/jira/browse/TAJO-472
> Project: Tajo
> Issue Type: New Feature
> Components: distributed query plan, physical operator
> Reporter: Min Zhou
> Assignee: Min Zhou
> Attachments: TAJO-472 Proposal.pdf
>
>
> Previously, I was involved as a technical expert into an in-memory database
> for on-line businesses in Alibaba group. That's an internal project, which
> can do group by aggregation on billions of rows in less than 1 second.
> I'd like to apply this technology into tajo, make it much faster than it is.
> From some benchmark, we believe that spark&shark currently is the fastest
> solution among all the open source interactive query system , such as impala,
> presto, tajo. The main reason is that it benefit from in-memory data.
> I will take memory cached table as my first step to accelerate query speed
> of tajo. Actually , this is the reason why I concerned at table partition
> during Xmas and new year holidays.
> Will submit a proposal soon.
>
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)