Hello everyone, I'm Giang Nguyen, a student at the University of Florida. I have been trying to get familiar with Madlib as a user to find out what I can potentially contribute to Madlib. I have run CRF as a user, and one of the thing I noticed that could cause users (especially users that aren't familiar with CRF and Postgres) some trouble is that they have to manually create the testing segment table test_segmenttbl(doc_id integer, start_pos integer, seg_text text) to feed into the crf_test_fgen(). This could be a tedious task for users especially when they have a big corpus of text. I think It could be very helpful if we write a python script in Madlib to tokenize words and assign the doc_id and start_pos correspondingly and store it into the database. Hence, users can save a lot more time when using CRF and also enable them to conveniently run crf model on big testing data.
Best, Giang Nguyen University of Florida
