[ 
https://issues.apache.org/jira/browse/BEAM-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304089#comment-16304089
 ] 

ASF GitHub Bot commented on BEAM-2728:
--------------------------------------

ArnaudFnr opened a new pull request #4328: [BEAM-2728] Add Count-Min Sketch in 
sketching extension
URL: https://github.com/apache/beam/pull/4328
 
 
   This pull request adds PTransforms to estimate the frequency of elements in 
a stream using Stream-Lib's Count-Min sketch implementation. 
   The SketchFrequencies class is designed in the same manner as 
ApproximateDistinct. It takes any encodable input and outputs a Count-Min 
Sketch that can be queried.
   
   Some elements to discuss in particular : 
   
   - The stream-lib's Count-Min sketch implementation is embedded in an inner 
class "Sketch", so the user doesn't have to pull the Stream-Lib dependency.
   
   - The elements are hashed using Google's MurmurHash in 128 bits because 
Stream-Lib's Count-Min sketch only support Long or String types. This could be 
simplified using Spark-sketch implementation which also support Byte Arrays as 
input type, but it would pull more dependencies.
   
   - The user has to provide the element coder in order to query its estimate 
frequency from the resulting sketch (see estimateCount() method in Sketch 
class).
   It could be avoided if the coder was defined as an attribute of the inner 
Sketch, but in that case it should be serialized in the Sketch coder.
   
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
    - [ ] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
    - [ ] Each commit in the pull request should have a meaningful subject line 
and body.
    - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
    - [ ] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
    - [ ] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
    - [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   ---
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Extension for sketch-based statistics
> -------------------------------------
>
>                 Key: BEAM-2728
>                 URL: https://issues.apache.org/jira/browse/BEAM-2728
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-extensions
>            Reporter: Arnaud Fournier
>            Assignee: Arnaud Fournier
>            Priority: Minor
>
> Goal : Provide an extension library to compute approximate statistics on 
> streams.
> Interest : Probabilistic data structures can create an approximation (sketch) 
> of the current state of a stream without storing every element but rather 
> processing each observation quickly to summarize its current state and find 
> useful statistical insights.
> Implementation is here : 
> https://github.com/ArnaudFnr/beam/tree/sketching/sdks/java/extensions/sketching
> More info : 
> https://docs.google.com/document/d/1Xy6g5RPBYX_HadpIr_2WrUeusiwL0Jo2ACI5PEOP1kc/edit



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to