simonecorsi commented on issue #6763:
URL: 
https://github.com/apache/incubator-devlake/issues/6763#issuecomment-1882987284

   Hey,  I'm currently working to gain a deeper understanding of the system 
requirements, sizing considerations, and scalability aspects to run DevLake on 
our org GitLab instance, particularly in the context of managing a substantial 
amount of data and logs associated with projects spanning several years.
   
   I've been examining the `_raw_gitlab_api_job` and noticed a discrepancy in 
the job count when comparing it with the parsed data. 
   
   My understanding of the data model is that the primary key `id` mirrors the 
GitLab unique job id, so there should be 1:1 count, but as pointed by 
@antoniomuso there seems to be more for some reason. 
   
   It's difficult for me, even picking random samples to find it on dataset 
this big, so I wanted to ask someone with a wider knowledge of DevLake 
internals.
   
   I get the necessity to preserve raw data to avoid reprocessing (which is 
good), but I'm concerned about scalability, if this cannot be removed or 
cleaned and needs to be kept, we could consider this a long-term JSON bucket, 
would it then be viable to consider implementing data compression techniques 
for this JSON bucket? Several compression algorithms are available that could 
significantly optimize storage usage.
   
   Thanks in advance for any clarification 🙏 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to