[
https://issues.apache.org/jira/browse/BEAM-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304089#comment-16304089
]
ASF GitHub Bot commented on BEAM-2728:
--------------------------------------
ArnaudFnr opened a new pull request #4328: [BEAM-2728] Add Count-Min Sketch in
sketching extension
URL: https://github.com/apache/beam/pull/4328
This pull request adds PTransforms to estimate the frequency of elements in
a stream using Stream-Lib's Count-Min sketch implementation.
The SketchFrequencies class is designed in the same manner as
ApproximateDistinct. It takes any encodable input and outputs a Count-Min
Sketch that can be queried.
Some elements to discuss in particular :
- The stream-lib's Count-Min sketch implementation is embedded in an inner
class "Sketch", so the user doesn't have to pull the Stream-Lib dependency.
- The elements are hashed using Google's MurmurHash in 128 bits because
Stream-Lib's Count-Min sketch only support Long or String types. This could be
simplified using Spark-sketch implementation which also support Byte Arrays as
input type, but it would pull more dependencies.
- The user has to provide the element coder in order to query its estimate
frequency from the resulting sketch (see estimateCount() method in Sketch
class).
It could be avoided if the coder was defined as an attribute of the inner
Sketch, but in that case it should be serialized in the Sketch coder.
Follow this checklist to help us incorporate your contribution quickly and
easily:
- [ ] Make sure there is a [JIRA
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the
change (usually before you start working on it). Trivial changes like typos do
not require a JIRA issue. Your pull request should address just this issue,
without pulling in other changes.
- [ ] Each commit in the pull request should have a meaningful subject line
and body.
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA
issue.
- [ ] Write a pull request description that is detailed enough to
understand what the pull request does, how, and why.
- [ ] Run `mvn clean verify` to make sure basic checks pass. A more
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
---
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Extension for sketch-based statistics
> -------------------------------------
>
> Key: BEAM-2728
> URL: https://issues.apache.org/jira/browse/BEAM-2728
> Project: Beam
> Issue Type: New Feature
> Components: sdk-java-extensions
> Reporter: Arnaud Fournier
> Assignee: Arnaud Fournier
> Priority: Minor
>
> Goal : Provide an extension library to compute approximate statistics on
> streams.
> Interest : Probabilistic data structures can create an approximation (sketch)
> of the current state of a stream without storing every element but rather
> processing each observation quickly to summarize its current state and find
> useful statistical insights.
> Implementation is here :
> https://github.com/ArnaudFnr/beam/tree/sketching/sdks/java/extensions/sketching
> More info :
> https://docs.google.com/document/d/1Xy6g5RPBYX_HadpIr_2WrUeusiwL0Jo2ACI5PEOP1kc/edit
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)