clintropolis edited a comment on issue #7919: disable all compression in 
intermediate segment persists while ingestion
URL: https://github.com/apache/incubator-druid/pull/7919#issuecomment-503283496
 
 
   Hmm, I _really_ like the idea of being able to separately control this stuff 
for the intermediary segments so mega +1 on that, however I'm not sure how I 
feel about straight `UNCOMPRESSED` being the default behavior (if I understand 
this PR correctly). I think we should consider if this is the best thing to do 
to avoid the unbounded usage of the 64k processing buffers used for 
decompression, and maybe we should measure some things? My fear is this as the 
default could potentially change the dynamic of realtime indexing tasks quite a 
lot, namely how much memory impact they have on page cache, potentially 
exaggerating issues like described in #6699 (though I suspect running realtime 
tasks via YARN is rare-ish). 
   
   Experimentation I did related to #6016 showed an often _very dramatic_ size 
difference between compressed and non-compressed data, particularly with int 
and long typed columns that I don't think can be ignored. Even the size 
difference between using `CompressedVSizeByte` and `VSizeByte` versions of int 
columns could be very large. I will see if I can dig up some measurements where 
i collected uncompressed and/or vsize byte packed sizes and share them here.
   
   Some other ideas have been suggested to help mitigate merging issues like 
this I think. #5526 suggests adjustments to the merging algorithm that would 
reduce overall memory usage such as producing the dictionary out of band, and 
reducing the duplicated number of operations which I would suspect would reduce 
overall memory usage.
   
   #7900 additionally suggests another thing we could do that would 
specifically help the unbounded decompression buffer on merge issue, in the 
form of doing a sort of hierarchical merge. Since we can measure how many 
intermediary segments we have with column counts for each, we could calculate 
how much buffer required and divide merge work up as necessary to keep total 
usage at a reasonable size. If I understand correctly, it also suggests some 
similar reworking of merge algorithm as mentioned in #5526.
   
   The other thing that makes me think this might not be the best _default_ 
behavior at least, is that to simply things for new users, the documentation 
for getting started and smaller cluster tuning suggests running co-tenant 
middle-manager and historical processes, and I suspect if uncompressed columns 
size differences are noticeable that this will greatly increase the amount of 
contention of the os free space between these processes, especially at merge 
time where all columns of all intermediary segments will be paged in during 
merge (#6699 again, some related discussion on this in comments). This is 
already a thing that bothers me about this setup, and I think it could make 
this issue become a lot worse.
   
   It's also totally possible i'm being overly cautious, but I think we need 
more data before going with these defaults.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to