[I] [Bug][Jira] _raw_jira_api_epics table accumulates duplicate data across runs, causing extractEpics subtask performance degradation [incubator-devlake]

via GitHub Sun, 27 Apr 2025 23:19:09 -0700


narrowizard opened a new issue, #8409:
URL: https://github.com/apache/incubator-devlake/issues/8409


   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   When running data collection pipelines for Jira, we observed that the 
`_raw_jira_api_epics` table in the DevLake database continuously accumulates 
duplicate entries for the same Jira epics across multiple collection runs. Each 
subsequent successful collection run adds a new batch of raw data for epics, 
but the older, seemingly identical data for the same epics from previous runs 
is not removed or updated.
   
   This unchecked growth of duplicate raw data in `_raw_jira_api_epics` is 
causing a significant performance issue. The extractEpics subtask, which 
presumably processes this raw data, takes increasingly longer to complete with 
each collection run due to the large volume of redundant data it has to handle.
   
   ### What do you expect to happen
   
   We expect DevLake to manage the data in the `_raw_jira_api_epics` table in a 
way that prevents the indefinite accumulation of identical duplicate records 
across collection runs for the same source data.
   
   Ideally, on subsequent collection runs for the same Jira connection and 
boards:
   
   1. The system should avoid inserting data that is an exact duplicate of what 
is already present for a given epic.
   2. Alternatively, old raw data for epics could be replaced or purged before 
or after inserting fresh data, ensuring the raw table doesn't grow indefinitely 
with duplicates.
   Preventing this accumulation of duplicates in the raw table should resolve 
the observed performance degradation and reduce the execution time of the 
`extractEpics` subtask to a consistent level.
   
   ### How to reproduce
   
   1. Set up an Apache DevLake instance.
   2. Configure a data connection to a Jira instance that contains some epics.
   3. Create and run a data collection pipeline using the configured Jira 
connection for one or more boards containing epics.
   4. After the first run completes successfully, trigger and run the same 
DevLake collection pipeline for the same Jira connection and boards again.
   5. Repeat step 4 multiple times (e.g., 2-3 more times).
   6. Observe the execution time of the `extractEpics` subtask in the later 
runs compared to the first run; it should show a noticeable increase.
   7. Inspect the contents of the `_raw_jira_api_epics` table in the DevLake 
database after multiple runs. You should find multiple rows with identical 
content (representing the same Jira epic, e.g., identified by the same URL), 
confirming the presence of duplicate data.
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   main
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@devlake.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [Bug][Jira] _raw_jira_api_epics table accumulates duplicate data across runs, causing extractEpics subtask performance degradation [incubator-devlake]

Reply via email to