Nickcw6 opened a new issue, #7797:
URL: https://github.com/apache/incubator-devlake/issues/7797

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   When collecting CircleCI pipelines, the time range specified in the sync 
policy has no effect on the data collected - pipelines are collected from 
before the specified date.
   
   E.g. Sync policy settings set to collect from 1st June 2024:
   ![Screenshot 2024-07-30 at 16 34 
46](https://github.com/user-attachments/assets/610cf176-34ed-4a08-b457-3496837c26a5)
   
   ![Screenshot 2024-07-30 at 16 53 
41](https://github.com/user-attachments/assets/1bc898e1-8424-4de0-9612-999a6d39a7d6)
   
   Excerpt of `data` JSON blob from top row - has `created_date` and 
`updated_date` of 1st Feb 2024 (ie. 180 days ago from todays date - 2024-07-30):
   
   ```
   {
       "id" : "eae60b4c-7dcc-4293-8b00-45f18a494881",
       "updated_at" : "2024-02-01T14:11:31.988Z",
       "created_at" : "2024-02-01T14:11:31.988Z",
       ...
       "trigger" : {
         "received_at" : "2024-02-01T14:11:31.481Z",
         "type" : "webhook",
       ...
       },
       ...
     }
   ```
   
   
   
   
   ### What do you expect to happen
   
   No CircleCI pipelines, workflows or jobs are collected from before the time 
range start point.
   
   ### How to reproduce
   
   1. Set the time frame to a date before any CircleCI data retention period 
ends (e.g. if retention period is 90 days, set this to 30 days).
   2. Run the DevLake pipeline
   3. Sort the `_raw_circleci_api_pipelines` table by the `created_at` JSON 
property of the `data` column:
   ```
   SELECT *
   FROM _raw_circleci_api_pipelines
   ORDER BY STR_TO_DATE(JSON_UNQUOTE(JSON_EXTRACT(CONVERT(data USING utf8mb4), 
'$.created_at')), '%Y-%m-%dT%H:%i:%s.%fZ') ASC;
   ```
   4. Compare the `created_at` property to that set in the sync policy time 
range.
   
   ### Anything else
   
   This is in part due to the recent pagination fix on the plugin (#7770) - the 
pagination works but as the CircleCI API does not offer any date range 
pagination controls, the collector now loops through the pages until 
`next_page_token` is `null`, which is whenever the data retention limit is hit 
for the account (e.g. for me it is 180 days, but could be less/more, see 
[here](https://support.circleci.com/hc/en-us/articles/5645222646939-Data-retention-policy)).
   
   When subsequently attempting to collect the relevant workflows & jobs for 
the pipeline, this will return a 404 and error the DevLake pipeline for any 
data points that fall outside of the data retention range:
   
   ```
   subtask collectWorkflows ended unexpectedly Wraps: (2) Retry exceeded 3 
times calling /v2/pipeline/6b7c4513-56bd-4e0c-ad72-d562df7513b1/workflow. The 
last error was: Http DoAsync error calling [method:GET 
path:/v2/pipeline/6b7c4513-56bd-4e0c-ad72-d562df7513b1/workflow query:map[]]. 
Response: {:message "Pipeline not found"} (404) Error types: (1) 
*hintdetail.withDetail (2) *errors.errorString
   ```
   
   
   There needs to be an additional check that the `created_at` property of the 
returned pipelines is not before the specified time range starting point. 
   
   ### Version
   
   44c3ecb8194d008dd4e3914bb534d2a26eeefdae
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@devlake.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to