GitHub user chenlica created a discussion: CS290: Text Analytics in the Big Data Era
The content is from https://github.com/apache/texera/wiki/CS290-Spring-2017 (may be dangling) ====== # CS290: Text Analytics in the Big Data Era Spring 2017, Department of Computer Science, UC Irvine * Instructor: [Prof. Chen Li](http://chenli.ics.uci.edu/) * Lecture time: Wednesday 3:30-5:00 pm, DBH 3011 **Goal**: * Gain hands-on experiences to build a system to manage large amounts of text information * Study research challenges related to text and data management * Form teams to do a group project; learn tools and skills to manage a software project. Schedule | No. | Date | Topics | Todos | | ------------- |:-------------:| :-----| :--------| | 01 | 04/05/2017 | [Running GUI](https://github.com/Texera/texera/wiki/Running-Texera-GUI), [Use cases](https://github.com/Texera/texera/wiki/Data-Sets), [Task assignments](https://docs.google.com/spreadsheets/d/1kTUK-T_2w5J53YJACxj1WNN7Pc43BF-4q3Rczrv9yA4/edit#gid=803748785) | Make GUI work on your data; [Initial Design Google Doc linked on github issue](https://github.com/Texera/texera/issues) | | 02 | 04/12/2017 | Status update | (1) Medline team: Modify the backend to let DictionaryMatcher also accept a file as the input; (2) Twitter team: Add sentiment analysis module/operator (work with @zuozhiw); use Stanford NLP to split a document into sentences; (3) ProposalReport team: Wrap up with query plan the current Chinese proposal data and move on to the next dataset and task; (4) LegalDoc team: modify the Join operator (JoinDistancePredicate) to exclude joined spans that completely contained by other spans, and implement a PDF-to-text operator; (5) SmartGui team: modify the RelationManager to expose the metadata to texera-web server | | 03 | 04/17/2017 | Status update | (1) SmartGui team: implement the autocomplete using the new GUI in the branch of `zuozhi-demo-base`; (2) Medline team: implement the new file-based dictionary using the new engine (already in master); implement a PDF2Text operator; implement a regex operator using earlier labeled entities; (3) Twitter team: implement a NlpSentenceSplitter operator; (4) LegalDoc team: implement a regex operator using earlier labeled entities; (5) ProposalReport team: implement an operator to write results to an Excel file. | | 04 | 04/26/2017 | Status update | (1) SmartGui team: implement an interface to upload dictionaries to the backend to be persistent; (2) Medline team: Continue the task of developing an operator to support regex with labeled variables; (3) Twitter team: finish the NlpSentenceSplitter operator and look for other NLP packages for tweets; (4) LegalDoc team: design the regex operator with variables; (5) ProposalReport team: finish the ExcelFileSink operator, and implement an AsterixDB Sink operator. | | 05 | 05/03/2017 | Status update | (1) Implement the SentenceSplitter operator with a flag (one tuple with a spanlist or multiple tuples); then talk to Prof. Huang to make similar changes to the RegexSplitter operator; (2) To support Python in Texera, implement a simple operator (e.g., "string-length()") using two different architectures, and evaluate the development experience and performance; (3) Finish the FileReader operator for different file formats; (4) Finish an operator to write results to an Excel/CSV file; (5) Finish the first implementation of RegexMatcher with variables, and think about how to improve its performance and expressive power; (6) SmartGUI: finish a PR of the backend with MetaData, and do another PR for the frontend autocomplete; (7) Implement an operator of sentiment analysis based on Emojis. | | 06 | 05/10/2017 | Status update | (1) Finish the SentenceSplitter operator; then talk to Prof. Huang to make similar changes to the RegexSplitter operator; (2) Implement a simple operator based on NLTK (in Python) using two different architectures, and evaluate the development experience and performance; (3) Finish an AsterixDB reader and writer; (4) Finish the first implementation of RegexMatcher with variables; (5) Improve its performance and expressive power; (6) SmartGUI: finish a PR of the backend with MetaData, and do another PR for the frontend autocomplete; (7) Implement an operator of sentiment analysis based on Emojis. | | 07 | 05/17/2017 | Status update | (1) Finish the SentenceSplitter operator; talk to Prof. Huang to make similar changes to the RegexSplitter operator; (2) Implement a simple operator based on NLTK (in Python) using two different architectures, and evaluate the development experience and performance; (3) Finish an AsterixDB writer; (4) Finish the first implementation of RegexMatcher with variables; (5) Improve its performance by evaluating a subclass of regexes without qualifiers without building an automaton; (6) Finish the PR for the frontend autocomplete; (7) Start implementing a UI to upload a dictionary; (8) Implement an operator of sentiment analysis based on Emojis. | | 08 | 05/24/2017 | Status update | (1) Finish the SentenceSplitter operator; talk to Prof. Huang to make similar changes to the RegexSplitter operator; (2) Implement a simple operator based on NLTK (in Python) using two different architectures, and evaluate the development experience and performance; (3) Finish the first implementation of RegexMatcher with variables; (4) Improve its performance by evaluating a subclass of regexes without qualifiers without building an automaton; (5) Start implementing a UI to upload a dictionary; (6) Implement an operator of sentiment analysis based on Emojis. | | 09 | 05/31/2017 | Status update | Finish the pending PRs, and prepare for the integration hackathon next Wednesday! | **Prerequisites:** * Desire to learn and build a real open source system; * Familiar with Java; * Hands-on system-building experiences; * Eager to solve open problems; * (Optional but a big plus) Have taken CS222 or CS221. **Software Tools**: * Java * Maven * Git * Wiki * Issue tracking **Project Protocol**: * Do not add large files to git. Check [github guidance](https://help.github.com/articles/what-is-my-disk-quota/) for details. * Write high-quality code. * Do high-quality peer reviews. * Write good documentations using github wiki. * Drawing diagrams: Use Google Drawings. Add diagram source files to [Google Drive](https://drive.google.com/folderview?id=0B_b7l2bhyZTuNzN1UlM2WjRiZlE&usp=sharing) and change the ownership to "texeraproject AT gmail.com". Add authors to each diagram, and include the source file link on the wiki. Here is an [example](https://github.com/chenlica/texera/wiki/Design-and-Architecture). * Use the "sandbox/" folder on git for your only experiments. Use the format of "[firstname]-[lastname]" (all lower case) for the name of your folder under "sandbox/". * Use [Github Issues](https://github.com/chenlica/texera/issues) to manage tasks and bugs. # Project Lead:  [Chen Li](https://github.com/chenlica) GitHub link: https://github.com/apache/texera/discussions/3955 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
