Hi John, We noticed that a simple group by query took disproportionately long on multiple partitions compared to a single partition. The profiler indicates that the bottleneck is in converting the original column from ids to strings (lots of string allocs), and then the group by operations (sort, reduce) are done on strings instead of the category ids. The reason for the string conversion seems to be that the ids aren't consistent across the partitions. Instead I propose re-mapping the ids into a shared dictionary in ibis::bord::column::append. The diff is attached, we observe about 5x-10x speedup depending on the number of columns in the group-by.
Steve
bord.cpp.diff
Description: bord.cpp.diff
bord.h.diff
Description: bord.h.diff
_______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
