simonecorsi commented on issue #6763: URL: https://github.com/apache/incubator-devlake/issues/6763#issuecomment-1882987284
Hey, I'm currently working to gain a deeper understanding of the system requirements, sizing considerations, and scalability aspects to run DevLake on our org GitLab instance, particularly in the context of managing a substantial amount of data and logs associated with projects spanning several years. I've been examining the `_raw_gitlab_api_job` and noticed a discrepancy in the job count when comparing it with the parsed data. My understanding of the data model is that the primary key `id` mirrors the GitLab unique job id, so there should be 1:1 count, but as pointed by @antoniomuso there seems to be more for some reason. It's difficult for me, even picking random samples to find it on dataset this big, so I wanted to ask someone with a wider knowledge of DevLake internals. I get the necessity to preserve raw data to avoid reprocessing (which is good), but I'm concerned about scalability, if this cannot be removed or cleaned and needs to be kept, we could consider this a long-term JSON bucket, would it then be viable to consider implementing data compression techniques for this JSON bucket? Several compression algorithms are available that could significantly optimize storage usage. Thanks in advance for any clarification 🙏 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
