Repository: incubator-griffin-site Updated Branches: refs/heads/master 37962fb59 -> 65f78a8e1
user story init Project: http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/commit/65f78a8e Tree: http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/tree/65f78a8e Diff: http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/diff/65f78a8e Branch: refs/heads/master Commit: 65f78a8e10d678d52abd4981bc0d4513613063fc Parents: 37962fb Author: William Guo <[email protected]> Authored: Fri May 4 14:10:38 2018 +0800 Committer: William Guo <[email protected]> Committed: Fri May 4 14:10:38 2018 +0800 ---------------------------------------------------------------------- source/_posts/userstory.md | 76 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/blob/65f78a8e/source/_posts/userstory.md ---------------------------------------------------------------------- diff --git a/source/_posts/userstory.md b/source/_posts/userstory.md new file mode 100644 index 0000000..e57dab2 --- /dev/null +++ b/source/_posts/userstory.md @@ -0,0 +1,76 @@ +--- +title: User Story +date: 2017-03-02 13:00:45 +tags: +--- + +# Crawler + + > The Crawler produces data according to passed in seed urls and extraction configuration. + > There could be a couple of dimensions of Data Quality which should be highlighted: + + * Data Completeness + * Data Freshness + +## Data Completeness(accuracy) + +### Problem + +Crawler needs a set of metrics to evaluate its data completeness. +Data Completeness is used to indicate whether every URL fed into crawler has its valid response in a reasonable duration. + * if an URL has its corresponding response, called mapped URL. + * Data Completeness is a ratio: the number unmapped URL compared with total number of URLs in a certain time window. + * The time window here is the time window of input, not the duration mentioned above. + +### Challenge + +The input/output could be more complicated in crawler: + * Batch & Streaming & batch-in-streaming-out: the input/output could be in HDFS & Kafka & Mongo + * Normal Crawling & Depth Crawling + + | normal crawling | depth crawling + | --------------- | -------------- +Batch | | +streaming | â | â +batch-streaming | | + +### Trouble Shooting + +Griffin will mark and record all the unmatched rows, so that the further trouble shooting could start with this. +Get back to streaming case, the retention time have to be extended so that trouble shooting can go with it. + +### Solution + +#### Streaming + normal crawling +In such a case, all the input URLs are submitted into Kafka, and all the responses are written back to Kafka (different topic). +So the Data Completeness of this case is pretty easy to define: + +1. for a given time window, +2. every URL submitted into kafka +3. should be able to have an response in a certain duration. +4. The ratio: number_of_URLs_without_response / number_of_all_URLs + +#### Griffin will work in this way: +1. create 2 connector, input/output consumer, which read data from kafka +2. in a certain time window, for any given URL of input, + * griffin searches the corresponding response in output topic of kafka + * if found, pick next one + * if not found, try again later, until a given time duration is out. + * calculate the ratio, markdown the unmapped URLs +3. output consumer will persist the response data to HDFS. This is to optimize the memory usage, since we donât want to keep a giant number of data in memory. +4. push the DQ results back to griffin DB, so that the service can offer retrieve service, and front end can show the metrics. + +#### Crawler integration + +Crawler should have its own dashboard, which call the griffin service and render the metrics. + + +DQ framework: Leveraging Apache Griffin - data quality solution for validating streaming data. Open sourced by Data Services CCOE team on 2017 Jan. +End-to-end data completeness metrics for crawler streaming jobs. Itâs calculated against input urls and corresponding output for each steaming job. +dq=#(output) / #(input) * 100% Output is: user receives output, Input is: the number of urls which user submitted Our Target DQ is: 95% + +Griffin compares input data with output data dumped from Kafka topics every 10 seconds, if the data is found, we mark it as matched data, if not found even after 24 hours, we mark it as unmatched data and stops the comparison for these input. + + + +
