On 2/23/16 11:07 AM, Nguyen,Giang H wrote:
I think It could be very helpful if we write a python script in Madlib to tokenize words and assign the doc_id and start_pos correspondingly and store it into the database. Hence, users can save a lot more time when using CRF and also enable them to conveniently run crf model on big testing data.
Perhaps the Postgres text search stuff could be used for this (maybe to_tsvector())?
-- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
