[jira] [Comment Edited] (TAJO-283) Add Table Partitioning

Min Zhou (JIRA) Mon, 23 Dec 2013 10:44:03 -0800

    [ 
https://issues.apache.org/jira/browse/TAJO-283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855821#comment-13855821
 ]


Min Zhou edited comment on TAJO-283 at 12/23/13 6:40 PM:
---------------------------------------------------------

[~jihoonson]
That's absolutely a good question!
I have thought about this problem. Firstly, we should figure out how large the 
number of partitions is acceptable. From my experience, MySQL works well if we 
insert thousands of rows in a time, even tens of thousands are still 
acceptable. But if the order of magnitude grows to hundreds of thousands , even 
millions or more, MySQL would be very slow when inserting&retrieving those 
records. 
When we are using HASH partition,  since we can defined the buckets number of 
hash function, I think the number is under control. Normally it should be tens 
or hundreds . For RANGE and LIST partition,  it works as well due to the 
partitions is enumerable. The worst situation I think is when we are using  
COLUMN partitions on a table,  which is quite similar with hive's dynamic 
partition list below.
{noformat}
CREATE TABLE dst_tbl (key int, value string) PARTITIONED BY (col1 string, col2 
it) AS
SELECT key, value,  col1, col2 FROM src_tbl
{noformat}
Query users always have no knowledge about this table's  value distribution. If 
the table is with high cardinality (a.k.a with so many distinct values), that 
should be a disaster for the below area
1. The number of files/directories on hdfs would be very large, big pressure 
for HDFS namenode's memory
2. As you mentioned, this would be a big problem for catalog.

Acutally, due to the above reasons. In Alibaba.com, my previous employer, which 
has one of the largest single hadoop cluster in the world, we disabled dynamic 
partitioning.  I think  you should run into the same problem when you are using 
column partitioning.  I don't know why you guys decide to support such feature, 
could you give me some background about it? How can we benefit from column 
partitions?

[~hyunsik]
It's good to know tajo will support indexes.  I saw the binary search tree 
index in the branch.  Actually, I am considering about adding lucene index into 
tajo, through which we can implements an online BA system on the top of tajo 
like senseidb.  We can do group by aggregations on billions of rows with only a 
few milliseconds.  If I implement it, we can put tajo into production in 
linkedin, my current employer.  

[~hyunsik] [~jihoonson] 
Thank you. Merry Christmas!

Min
  


was (Author: coderplay):
[~jihoonson]
That's absolutely a good question!
I have thought about this problem. Firstly, we should figure out how large the 
number of partitions is acceptable. From my experience, MySQL works well if we 
insert thousands of rows in a time, even tens of thousands are still 
acceptable. But if the order of magnitude grows to hundreds of thousands , even 
millions or more, MySQL would be very slow when inserting&retrieving those 
records. 
When we are using HASH partition,  since we can defined the buckets number of 
hash function, I think the number is under control. Normally it should be tens 
or hundreds . For RANGE and LIST partition,  it works as well due to the 
partitions is enumerable. The worst situation I think is when we are using  
COLUMN partitions on a table,  which is quite similar with hive's dynamic 
partition list below.
{noformat}
CREATE TABLE dst_tbl (key int, value string) PARTITIONED BY (col1 string, col2 
it) AS
SELECT key, value,  col1, col2 FROM src_tbl
{noformat}
Query users always have no knowledge about this table's  value distribution. If 
the table is with high cardinality (a.k.a with so many distinct values), that 
should be a disaster for the below area
1. The number of files/directories on hdfs would be very large, big pressure 
for HDFS namenode's memory
2. As you mentioned, this would be a big problem for catalog.

Acutally, due to the above reasons. In Alibaba.com, my previous employer, which 
has one of the largest single hadoop cluster in the world, we disabled dynamic 
partitioning.  I think  you should run into the same problem when you are using 
column partitioning.  I don't know why you guys decide to support such feature, 
could you give me some background about it? How can we benefit from column 
partitions?

[~hyunsik]
It's good to know tajo will support indexes.  I saw the binary search tree 
index in the branch.  Actually, I am considering about adding lucene index into 
tajo, through which we can implements an online BA system on the top of tajo 
like senseidb.  We can do aggregations on billions of rows with only a few 
milliseconds.  If I implement it, we can put tajo into production in linkedin, 
my current employer.  

[~hyunsik] [~jihoonson] 
Thank you. Merry Christmas!

Min
  

> Add Table Partitioning
> ----------------------
>
>                 Key: TAJO-283
>                 URL: https://issues.apache.org/jira/browse/TAJO-283
>             Project: Tajo
>          Issue Type: New Feature
>          Components: catalog, physical operator, planner/optimizer
>            Reporter: Hyunsik Choi
>            Assignee: Hyunsik Choi
>             Fix For: 0.8-incubating
>
>
> Table partitioning gives many facilities to maintain large tables. First of 
> all, it enables the data management system to prune many input data which are 
> actually not necessary. In addition, it gives the system more optimization  
> opportunities  that exploit the physical layouts.
> Basically, Tajo should follow the RDBMS-style partitioning system, including 
> range, list, hash, and so on. In order to keep Hive compatibility, we need to 
> add Hive partition type that does not exists in existing DBMS systems.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Comment Edited] (TAJO-283) Add Table Partitioning

Reply via email to