Nickcw6 opened a new issue, #7797: URL: https://github.com/apache/incubator-devlake/issues/7797
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and found no similar issues. ### What happened When collecting CircleCI pipelines, the time range specified in the sync policy has no effect on the data collected - pipelines are collected from before the specified date. E.g. Sync policy settings set to collect from 1st June 2024:   Excerpt of `data` JSON blob from top row - has `created_date` and `updated_date` of 1st Feb 2024 (ie. 180 days ago from todays date - 2024-07-30): ``` { "id" : "eae60b4c-7dcc-4293-8b00-45f18a494881", "updated_at" : "2024-02-01T14:11:31.988Z", "created_at" : "2024-02-01T14:11:31.988Z", ... "trigger" : { "received_at" : "2024-02-01T14:11:31.481Z", "type" : "webhook", ... }, ... } ``` ### What do you expect to happen No CircleCI pipelines, workflows or jobs are collected from before the time range start point. ### How to reproduce 1. Set the time frame to a date before any CircleCI data retention period ends (e.g. if retention period is 90 days, set this to 30 days). 2. Run the DevLake pipeline 3. Sort the `_raw_circleci_api_pipelines` table by the `created_at` JSON property of the `data` column: ``` SELECT * FROM _raw_circleci_api_pipelines ORDER BY STR_TO_DATE(JSON_UNQUOTE(JSON_EXTRACT(CONVERT(data USING utf8mb4), '$.created_at')), '%Y-%m-%dT%H:%i:%s.%fZ') ASC; ``` 4. Compare the `created_at` property to that set in the sync policy time range. ### Anything else This is in part due to the recent pagination fix on the plugin (#7770) - the pagination works but as the CircleCI API does not offer any date range pagination controls, the collector now loops through the pages until `next_page_token` is `null`, which is whenever the data retention limit is hit for the account (e.g. for me it is 180 days, but could be less/more, see [here](https://support.circleci.com/hc/en-us/articles/5645222646939-Data-retention-policy)). When subsequently attempting to collect the relevant workflows & jobs for the pipeline, this will return a 404 and error the DevLake pipeline for any data points that fall outside of the data retention range: ``` subtask collectWorkflows ended unexpectedly Wraps: (2) Retry exceeded 3 times calling /v2/pipeline/6b7c4513-56bd-4e0c-ad72-d562df7513b1/workflow. The last error was: Http DoAsync error calling [method:GET path:/v2/pipeline/6b7c4513-56bd-4e0c-ad72-d562df7513b1/workflow query:map[]]. Response: {:message "Pipeline not found"} (404) Error types: (1) *hintdetail.withDetail (2) *errors.errorString ``` There needs to be an additional check that the `created_at` property of the returned pipelines is not before the specified time range starting point. ### Version 44c3ecb8194d008dd4e3914bb534d2a26eeefdae ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@devlake.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org