Hello, I've started implementing a system for faster headline generation. WIP patch is attached.
The idea is to make a new type currently called hltext (different names welcome), that stores the text along with the lexization result. It conceptually stores an array of tuples like (word text, type int, lexemes text[] ) A console log is also attached - it shows 5x preformance increase. The problem is not academic, I have such long texts in an app, making 20 headlines takes 3s+. The patch lacks documentation, regression tests, and most auxillary functions (especially I/O functions). I have a question about the I/O functions of the new type. What format to choose? I could make the input function read something like 'english: the text' where english is the name of the text search configuration . The input function would do the lexizing. I could make it read some custom format, which would contain the tokens, token types and lexemes. Can I use flex/bison, or is there a good reason not to, and I should make it a hand-made parser? finally, I could make the type actually "create type hltex_element(word text, type int, lexemes text[] )", by manually filling in the applicable catalogs, and make the user make columns as hltext_element[]. Is there a nice way to manipulate objects of such a type from within the backend? Is there an example? I suppose that in this case storage would not be as efficient as I made it. which one to choose? Other ideas? Regards Marcin Mańk
$ psql -p 5454 postgres -c 'create table tmp(t text, hlt hltext);'; CREATE TABLE $ bash -c 'echo "insert into tmp(t) values(\$CUTCUT\$" ; curl -d "" http://en.wikipedia.org/wiki/Michael_Jackson; echo "\$CUTCUT\$)"' | ./bin/psql -p 5454 postgres % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 636k 100 636k 0 0 224k 0 0:00:02 0:00:02 --:--:-- 258k INSERT 0 1 $ psql -p 5454 postgres psql (9.0.5, server 9.3devel) WARNING: psql version 9.0, server version 9.3. Some psql features might not work. Type "help" for help. postgres=# update tmp set hlt = to_hltext('english', t); UPDATE 1 postgres=# \timing Timing is on. postgres=# select ts_headline('english', t, to_tsquery('janet & jackson'), 'MaxFragments=2 MinWords=5 MaxWords=15') from tmp; ts_headline ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- <b>Jackson</b>-Style. (video production; Michael and <b>Janet</b> <b>Jackson</b> video) . 29 . Theatre Crafts International ... Green Day Look Forward To <b>Janet</b> <b>Jackson</b>'s VMA Tribute To Michael" . MTV . September (1 row) Time: 414,588 ms postgres=# select ts_headline('english', t, to_tsquery('janet & jackson'), 'MaxFragments=2 MinWords=5 MaxWords=15') from tmp; ts_headline ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- <b>Jackson</b>-Style. (video production; Michael and <b>Janet</b> <b>Jackson</b> video) . 29 . Theatre Crafts International ... Green Day Look Forward To <b>Janet</b> <b>Jackson</b>'s VMA Tribute To Michael" . MTV . September (1 row) Time: 75,912 ms postgres=# select ts_headline('english', hlt, to_tsquery('janet & jackson'), 'MaxFragments=2 MinWords=5 MaxWords=15') from tmp; ts_headline ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- <b>Jackson</b>-Style. (video production; Michael and <b>Janet</b> <b>Jackson</b> video) . 29 . Theatre Crafts International ... Green Day Look Forward To <b>Janet</b> <b>Jackson</b>'s VMA Tribute To Michael" . MTV . September (1 row) Time: 17,539 ms postgres=# select ts_headline('english', hlt, to_tsquery('janet & jackson'), 'MaxFragments=2 MinWords=5 MaxWords=15') from tmp; ts_headline ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- <b>Jackson</b>-Style. (video production; Michael and <b>Janet</b> <b>Jackson</b> video) . 29 . Theatre Crafts International ... Green Day Look Forward To <b>Janet</b> <b>Jackson</b>'s VMA Tribute To Michael" . MTV . September (1 row) Time: 15,526 ms postgres=# select ts_headline('english', t, to_tsquery('janet & jackson'), 'MaxFragments=2 MinWords=5 MaxWords=15') from tmp; ts_headline ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- <b>Jackson</b>-Style. (video production; Michael and <b>Janet</b> <b>Jackson</b> video) . 29 . Theatre Crafts International ... Green Day Look Forward To <b>Janet</b> <b>Jackson</b>'s VMA Tribute To Michael" . MTV . September (1 row) Time: 74,807 ms postgres=#
hltext.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers