A small update on merge flow,
Currently, in local_sort SI merge, task launch is based on size, how many
carbon files is formed after the merge, that many tasks will be launched
for merge [CarbonSIRebuildRDD.internalGetPartitions].
Global_sort merge also implement identifying global_sort_partitions based
on how many carbon files is formed  after merge (similar to local sort
flow)

But we need to conclude on merge flow is really required or we can just
keep SI loading itself as 1 node 1 task logic [similar to our main table
local sort] and avoid the need for the merge operation.

Thanks,
Ajantha

On Fri, Nov 6, 2020 at 4:41 PM Ajantha Bhat <ajanthab...@gmail.com> wrote:

> Hi,
>
> when a carbon property *carbon.si.segment.merge = true*,
>
> *a)  local_sort SI segment loading (default) [All the SI columns are
> involved]*
>
> SI load will load with default local_sort. There will be two times data
> loading, the first time is by querying the main table and creating the SI
> segment (here the number of tasks launched is equal to carbon files present
> in the main table segment), during these operations currently SI creates
> many small files.
> Then the merge operation will query the newly created SI segment and load
> data by local_sort again (here few tasks are launched, one node one task),
> so fewer files created.
>
> *>> So, we can optimize the first time SI segment creation itself to use
> one node one task logic and avoid creating small files and remove calling
> merge operation. with this, we can remove carbon.si.segment.merge property
> itself.*
> *b) global_sort SI segment loading [All the SI columns are involved]*
>
> SI load will load with a global sort. There will be two times data
> loading, first time is by querying the main table and creating SI segment
> (here the number of tasks launched (global_sort_partitions) is equal to
> carbon files present in the main table segment), during this operations
> currently SI creates many small files.
> Then the merge operation will query the newly created SI segment and load
> data by local sort again [there is no global sort logic presently] (here
> few tasks are launched, one node one task), but this will disorder the
> globally sorted data!
>
> *>> So, the user can configure global sort partition, but if the user
> didn't configure, code can use global_sort_partitions = number of active
> nodes and load the data to avoid creating the small files and remove
> calling merge operation. with this, we can remove carbon.si.segment.merge
> property itself.*
> *c) REFRESH INDEX <index_table>  ON TABLE <main_table>*
> If the user created the SI table in the previous version and has small
> files, can use this command to merge the small files. But if the user drops
> the index and creates it again, then no need for this command also [because
> merge and creating new SI takes a similar time]. So, do we need to support
> this command for the global sort?
> If we decide to retain the rebuild command then for global_sort, we need
> to add a new implementation as this command has only local sort code.
>
> Let me know your opinion on this.
>
> Thanks,
> Ajantha
>

Reply via email to