[D] CS290: Text Analytics in the Big Data Era [texera]

via GitHub Sun, 19 Oct 2025 23:02:18 -0700


GitHub user chenlica created a discussion: CS290: Text Analytics in the Big 
Data Era


The content is from https://github.com/apache/texera/wiki/CS290-Spring-2017 
(may be dangling)

======
# CS290: Text Analytics in the Big Data Era  
Spring 2017, Department of Computer Science, UC Irvine  

* Instructor: [Prof. Chen Li](http://chenli.ics.uci.edu/)
* Lecture time: Wednesday 3:30-5:00 pm, DBH 3011

**Goal**: 
* Gain hands-on experiences to build a system to manage large amounts of text 
information
* Study research challenges related to text and data management
* Form teams to do a group project; learn tools and skills to manage a software 
project.

Schedule

|    No.    | Date           | Topics  |  Todos |
| ------------- |:-------------:| :-----| :--------| 
| 01      | 04/05/2017 | [Running 
GUI](https://github.com/Texera/texera/wiki/Running-Texera-GUI), [Use 
cases](https://github.com/Texera/texera/wiki/Data-Sets), [Task 
assignments](https://docs.google.com/spreadsheets/d/1kTUK-T_2w5J53YJACxj1WNN7Pc43BF-4q3Rczrv9yA4/edit#gid=803748785)
 | Make GUI work on your data; [Initial Design Google Doc linked on github 
issue](https://github.com/Texera/texera/issues) |
| 02      | 04/12/2017 | Status update | (1) Medline team: Modify the backend 
to let DictionaryMatcher also accept a file as the input; (2) Twitter team: Add 
sentiment analysis module/operator (work with @zuozhiw); use Stanford NLP to 
split a document into sentences; (3) ProposalReport team: Wrap up with query 
plan the current Chinese proposal data and move on to the next dataset and 
task; (4) LegalDoc team: modify the Join operator (JoinDistancePredicate) to 
exclude joined spans that completely contained by other spans, and implement a 
PDF-to-text operator; (5) SmartGui team: modify the RelationManager to expose 
the metadata to texera-web server |
| 03      | 04/17/2017 | Status update | (1) SmartGui team: implement the 
autocomplete using the new GUI in the branch of `zuozhi-demo-base`; (2) Medline 
team: implement the new file-based dictionary using the new engine (already in 
master); implement a PDF2Text operator; implement a regex operator using 
earlier labeled entities; (3) Twitter team: implement a NlpSentenceSplitter 
operator; (4) LegalDoc team: implement a regex operator using earlier labeled 
entities; (5) ProposalReport team: implement an operator to write results to an 
Excel file. |
| 04      | 04/26/2017 | Status update | (1) SmartGui team: implement an 
interface to upload dictionaries to the backend to be persistent;  (2) Medline 
team: Continue the task of developing an operator to support regex with labeled 
variables; (3) Twitter team: finish the NlpSentenceSplitter operator and look 
for other NLP packages for tweets; (4) LegalDoc team: design the regex operator 
with variables; (5) ProposalReport team: finish the ExcelFileSink operator, and 
implement an AsterixDB Sink operator. |
| 05      | 05/03/2017 | Status update | (1) Implement the SentenceSplitter 
operator with a flag (one tuple with a spanlist or multiple tuples); then talk 
to Prof. Huang to make similar changes to the RegexSplitter operator; (2) To 
support Python in Texera, implement a simple operator (e.g., "string-length()") 
using two different architectures, and evaluate the development experience and 
performance; (3) Finish the FileReader operator for different file formats; (4) 
Finish an operator to write results to an Excel/CSV file; (5) Finish the first 
implementation of RegexMatcher with variables, and think about how to improve 
its performance and expressive power; (6) SmartGUI: finish a PR of the backend 
with MetaData, and do another PR for the frontend autocomplete; (7) Implement 
an operator of sentiment analysis based on Emojis. |
| 06      | 05/10/2017 | Status update | (1) Finish the SentenceSplitter 
operator; then talk to Prof. Huang to make similar changes to the RegexSplitter 
operator; (2) Implement a simple operator based on NLTK (in Python) using two 
different architectures, and evaluate the development experience and 
performance; (3) Finish an AsterixDB reader and writer; (4) Finish the first 
implementation of RegexMatcher with variables; (5) Improve its performance and 
expressive power; (6) SmartGUI: finish a PR of the backend with MetaData, and 
do another PR for the frontend autocomplete; (7) Implement an operator of 
sentiment analysis based on Emojis. |
| 07      | 05/17/2017 | Status update | (1) Finish the SentenceSplitter 
operator; talk to Prof. Huang to make similar changes to the RegexSplitter 
operator; (2) Implement a simple operator based on NLTK (in Python) using two 
different architectures, and evaluate the development experience and 
performance; (3) Finish an AsterixDB writer; (4) Finish the first 
implementation of RegexMatcher with variables; (5) Improve its performance by 
evaluating a subclass of regexes without qualifiers without building an 
automaton; (6) Finish the PR for the frontend autocomplete; (7) Start 
implementing a UI to upload a dictionary; (8) Implement an operator of 
sentiment analysis based on Emojis. |
| 08      | 05/24/2017 | Status update | (1) Finish the SentenceSplitter 
operator; talk to Prof. Huang to make similar changes to the RegexSplitter 
operator; (2) Implement a simple operator based on NLTK (in Python) using two 
different architectures, and evaluate the development experience and 
performance; (3) Finish the first implementation of RegexMatcher with 
variables; (4) Improve its performance by evaluating a subclass of regexes 
without qualifiers without building an automaton; (5) Start implementing a UI 
to upload a dictionary; (6) Implement an operator of sentiment analysis based 
on Emojis. |
| 09      | 05/31/2017 | Status update | Finish the pending PRs, and prepare 
for the integration hackathon next Wednesday! |


**Prerequisites:**

* Desire to learn and build a real open source system;
* Familiar with Java;
* Hands-on system-building experiences;
* Eager to solve open problems;
* (Optional but a big plus) Have taken CS222 or CS221.

**Software Tools**:

* Java
* Maven
* Git
* Wiki
* Issue tracking

**Project Protocol**:

* Do not add large files to git.  Check [github 
guidance](https://help.github.com/articles/what-is-my-disk-quota/) for details.
* Write high-quality code.
* Do high-quality peer reviews.
* Write good documentations using github wiki. 
* Drawing diagrams: Use Google Drawings. Add diagram source files to [Google 
Drive](https://drive.google.com/folderview?id=0B_b7l2bhyZTuNzN1UlM2WjRiZlE&usp=sharing)
 and change the ownership to "texeraproject AT gmail.com".  Add authors to each 
diagram, and include the source file link on the wiki.  Here is an 
[example](https://github.com/chenlica/texera/wiki/Design-and-Architecture).
* Use the "sandbox/" folder on git for your only experiments.  Use the format 
of "[firstname]-[lastname]" (all lower case) for the name of your folder under 
"sandbox/".
* Use [Github Issues](https://github.com/chenlica/texera/issues) to manage 
tasks and bugs.

# Project Lead:
![Chen 
Li](https://docs.google.com/drawings/d/1PIQwRDWhX66nWYO1hAGn7DA3T5KnARz5S-FKeiJzHvs/pub?w=200&h=200)
  
[Chen Li](https://github.com/chenlica)  


GitHub link: https://github.com/apache/texera/discussions/3955

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]

[D] CS290: Text Analytics in the Big Data Era [texera]

Reply via email to