On Fri, Feb 19, 2016 at 8:28 PM, Josh berkus <j...@agliodbs.com> wrote:
> On 02/19/2016 05:49 AM, s d wrote: > >> On 19 February 2016 at 14:19, Bruce Momjian <br...@momjian.us >> <mailto:br...@momjian.us>> wrote: >> >> I wonder if PLPerl could be used to extract the words from a PDF >> document and create a tsvector column from it. >> >> >> I don't know about PLPerl(I'm pretty sure it could be used for this >> purpose, though.). On the other hand I've written code for this in >> Python which should be easy to adapt for PLPython, if necessary. >> > > I'd swear someone already built something to do this. All you need is a > library which reads PDF and transforms it into text, and then you can FTS > it. I know there's a module for OpenOffice docs somewhere as well, but > heck if I can remember where. > I used pdftotext for that. I think it'd be useful to have extension{s}, which can be used to convert anything to text. I remember someone indexed chemical formulae, TeX/LaTeX, DOC files. > > -- > -- > Josh Berkus > Red Hat OSAS > (any opinions are my own) > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >