Kenneth Knowles created BEAM-1439:
-------------------------------------

             Summary: Beam Example(s) exploring public document datasets
                 Key: BEAM-1439
                 URL: https://issues.apache.org/jira/browse/BEAM-1439
             Project: Beam
          Issue Type: Wish
          Components: examples-java
            Reporter: Kenneth Knowles
            Assignee: Kenneth Knowles
            Priority: Minor


In Beam, we have examples illustrating counting the occurrences of words and 
performing a basic TF-IDF analysis on the works of Shakespeare (or whatever you 
point it at). It would be even cooler to do these analyses, and more, on a much 
larger data set that is really the subject of current investigations.

In chatting with professors at the University of Washington, I've learned that 
scholars of many fields would really like to explore new and highly customized 
ways of processing the growing body of publicly-available scholarly documents, 
such as PubMed Central. Queries like "show me documents where chemical 
compounds X and Y were both used in the 'method' section"

So I propose a Google Summer of Code project wherein a student writes some 
large-scale Beam pipelines to perform analyses such as term frequency, bigram 
frequency, etc.

Skills required:
 - Java or Python
 - (nice to have) Working through the Beam getting started materials



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to