Hi community,

Currently for SHOW SEGMENT command, carbon will print:

+---+---------+-----------------------+-----------------------+---------+-----------+---------+----------+----+
|SegmentSequenceId   |Status   |Load Start Time        |Load End Time          
|Merged To|File Format|Data Size|Index Size|Path|
+---+---------+-----------------------+-----------------------+---------+-----------+---------+----------+----+
|5  |Compacted|2020-02-15 22:19:06.327|2020-02-15 22:19:06.516|4.1      
|COLUMNAR_V3|630.0B   |421.0B    |NA  |
|4.1|Success  |2020-02-15 22:19:06.327|2020-02-15 22:19:06.659|NA       
|COLUMNAR_V3|640.0B   |425.0B    |NA  |
|4  |Compacted|2020-02-15 22:19:05.894|2020-02-15 22:19:06.188|4.1      
|COLUMNAR_V3|630.0B   |421.0B    |NA  |
|3  |Compacted|2020-02-15 22:19:05.151|2020-02-15 22:19:05.391|2.1      
|COLUMNAR_V3|630.0B   |421.0B    |NA  |
|2.1|Compacted|2020-02-15 22:19:05.151|2020-02-15 22:19:05.592|0.2      
|COLUMNAR_V3|640.0B   |425.0B    |NA  |
|2  |Compacted|2020-02-15 22:19:04.804|2020-02-15 22:19:05.023|2.1      
|COLUMNAR_V3|630.0B   |421.0B    |NA  |
|1  |Compacted|2020-02-15 22:19:04.006|2020-02-15 22:19:04.242|0.1      
|COLUMNAR_V3|630.0B   |421.0B    |NA  |
|0.2|Success  |2020-02-15 22:19:05.151|2020-02-15 22:19:05.79 |NA       
|COLUMNAR_V3|658.0B   |425.0B    |NA  |
|0.1|Compacted|2020-02-15 22:19:04.006|2020-02-15 22:19:04.704|0.2      
|COLUMNAR_V3|640.0B   |425.0B    |NA  |
|0  |Compacted|2020-02-15 22:19:02.335|2020-02-15 22:19:03.775|0.1      
|COLUMNAR_V3|630.0B   |421.0B    |NA  |
+---+---------+-----------------------+-----------------------+---------+-----------+---------+----------+——+

While it prints basic info of segments, it does not support following usecase:

UserCase 1. Show for partition table
For partition table, have no way to know about the which partition that each 
segment mapped to. The partition-segment mapping is two way mapping, user may 
want to check which segments is belong to certain partition.

==> Suggest to add partition information for each segment when showing segments


UserCase 2. Show load delay
Have no way to know about the event time in the segment. By event time, I mean 
the timestamp column value in the record being loaded. In many cases, user 
wanted to know the load delay which indicate whether the system loading is 
catching up with the incoming data.

==> For this requirement, one suggestion is that we can show the minmax of the 
event time column in SHOW SEGMENTS command, then user can know about the 
difference of the Load Start Time and Event Time in the record. This solution 
depends on Segment level minmax (PR3584), and we may need user to specify the 
event_time column name in table property when creating table or use alter table 
to add it for existing table.


UserCase 3. Show load speed
Currenlty user need to calculate the load speed by Load Start Time and Load End 
Time, it is not so convenient for user

==> We can keep the Load Start Time and calculate the elapsed time and 
throughput(in MBps) by SHOW SEGMENTS automatically.


UserCase 4. Show more information and give more control to user, like filtering 
and sorting
Besides above information, actually there is more information kept in segment 
metadata but have not shown to the user. And sometimes user wanted to sort the 
output by certain field, like sort by load time

==> To show more detail info of the segment, I can think of two ways: 
    1) add SHOW SEGMENTS EXTENDED option
    2) To avoid making the output table too wide to read, I prefer to add DESC 
SEGMENTS command to show detail info for one segment by specifying the segment 
id by user.

And, to give more control for user, we can add some options for sorting and 
filter. 


In summary, to support all above use cases, I suggest to change the command 
like:

Example 1:
// show basic info
SHOW SEGMENTS ON table1
will output
+---+---------+-----------------------+-----------------------+------+-----------+
|ID |Status   |Load Start Time        |Event Time Min         |Spent 
|Throughput |
+---+---------+-----------------------+-----------------------+------+-----------+
|2  |Success  |2020-02-15 22:10:00.327|2020-02-15 21:06:00.327|3m    |28MBps    
 |
|1  |Success  |2020-02-15 22:15:08.283|2020-02-15 22:11:06.432|28s   |30MBps    
 |
|0  |Success  |2020-02-15 22:20:25.894|2020-02-15 22:17:06.621|4m6s  |25MBps    
 |
+---+---------+-----------------------+-----------------------+------+-----------+

this is short enough so that it can fit in most of the console line width

Example 2:
// To show partition for each segment
SHOW SEGMENTS ON table1 PARTITION
output:
+---+---------+-----------------------+-----------------------+--------------------+
|ID |Status   |Load Start Time        |Event Time Min         |PARTITION        
   |
+---+---------+-----------------------+-----------------------+--------------------+
|2  |Success  |2020-02-15 22:10:00.327|2020-02-15 
21:06:00.327|dt=29,planet=earth  |
|1  |Success  |2020-02-15 22:15:08.283|2020-02-15 
22:11:06.432|dt=30,planet=earth  |
|0  |Success  |2020-02-15 22:20:25.894|2020-02-15 
22:17:06.621|dt=34,planet=mars   |
+---+---------+-----------------------+-----------------------+--------------------+


Example 3:
// To filter out one partition
SHOW SEGMENTS ON table1 OPTOINS (‘partition’=‘dt=30,planet=earth’)

+---+---------+-----------------------+-----------------------+--------------------+
|ID |Status   |Load Start Time        |Event Time Min         |PARTITION        
   |
+---+---------+-----------------------+-----------------------+--------------------+
|1  |Success  |2020-02-15 22:15:08.283|2020-02-15 
22:11:06.432|dt=30,planet=earth  |
+---+---------+-----------------------+-----------------------+——————————+


Example 4:
// To sort by load start time
SHOW SEGMENTS ON table1 OPTOINS (’sort_by’=‘load_start_time')

+---+---------+-----------------------+-----------------------+--------------------+
|ID |Status   |Load Start Time        |Event Time Min         |PARTITION        
   |
+---+---------+-----------------------+-----------------------+--------------------+
|0  |Success  |2020-02-15 22:20:25.894|2020-02-15 
22:17:06.621|dt=34,planet=mars   |
|1  |Success  |2020-02-15 22:15:08.283|2020-02-15 
22:11:06.432|dt=30,planet=earth  |
|2  |Success  |2020-02-15 22:10:00.327|2020-02-15 
21:06:00.327|dt=29,planet=earth  |
+---+---------+-----------------------+-----------------------+--------------------+

The sort field can be: id, load_start_time, event_time_min, event_time_max, 
data_size, index_size, and column_name_min or column_name_max, where 
column_name can be any literal string and exist in the table schema.


Example 5:
// To show more detail info
DESC SEGMENT 2 ON table1

This will show all information for segment id 2.  Information including not 
only table status file but also info in segment file.


I hope this suggestion can make segment info more visible to the user.
What do you think?


Regards,
Jacky

Reply via email to