[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets

SungJunyoung (JIRA) Sun, 19 Mar 2017 18:58:35 -0700

    [ 
https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15932063#comment-15932063
 ]


SungJunyoung commented on BEAM-1439:
------------------------------------

The current Beam example counts the number of occurrences of a word for 
Shakespeare's work. This, of course, is a good indication of how Beam's basic 
pipeline construction works. However, this data is static, and does not show 
the characteristics of Beam that handles streaming data. What about example 
sources with streaming data like Kafka or Spark? For example, you could save 
your computer's input log to Kafka, convert it to a Beam, and then perform 
statistics on your input habits. What do you think about this?

Of course, ideas for large-scale pipelines will continue in processing in 
parallel like **Beam** :).

> Beam Example(s) exploring public document datasets
> --------------------------------------------------
>
>                 Key: BEAM-1439
>                 URL: https://issues.apache.org/jira/browse/BEAM-1439
>             Project: Beam
>          Issue Type: Wish
>          Components: examples-java
>            Reporter: Kenneth Knowles
>            Assignee: Kenneth Knowles
>            Priority: Minor
>              Labels: gsoc2017, java, mentor, python
>
> In Beam, we have examples illustrating counting the occurrences of words and 
> performing a basic TF-IDF analysis on the works of Shakespeare (or whatever 
> you point it at). It would be even cooler to do these analyses, and more, on 
> a much larger data set that is really the subject of current investigations.
> In chatting with professors at the University of Washington, I've learned 
> that scholars of many fields would really like to explore new and highly 
> customized ways of processing the growing body of publicly-available 
> scholarly documents, such as PubMed Central. Queries like "show me documents 
> where chemical compounds X and Y were both used in the 'method' section"
> So I propose a Google Summer of Code project wherein a student writes some 
> large-scale Beam pipelines to perform analyses such as term frequency, bigram 
> frequency, etc.
> Skills required:
>  - Java or Python
>  - (nice to have) Working through the Beam getting started materials



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets

Reply via email to