Re: General interest question: PDF contents handling in PostgreSQL.

Fabián R. Breschi Wed, 27 Nov 2002 02:02:36 -0800

Rob Nagler wrote:

I agree with you with the overhead provoked by the Oracle solution. 
Particularly, using intermedia with the 'Internet File System' option of 
8i/9i things get extremely complex in terms of manageability. On the 
other hand, the user friendly interface that allows to drop a file into 
the DB and get indexed on the fly has a high cost in terms of system


It isn't indexed on the fly in our version (8i).  Has this changed?
You have to run the indexer regularly, so in this it is no better than
external indexing solutions.  Indeed, one of the big problems is that
you can't qualify the query *prior* to index search afaik.  It seems
to search the entire index always.  In our case, this is extremely
costly, because our space naturally divides, and isolated indexes
would solve the problem much more efficiently.

Oracle claims to get the file search within the DB at a fraction of time respect to MS flat files in the IFS solution with 8i onwards (Enterprise edition), obviously it doesn't mean that indexing performs well compared to an analog solution. Didn't pay attention to the fact of the reindexing after dropping a doc inside IFS since I have definitively abandoned the idea due to performance issues. Looking backwards to the history, from Context to Intermedia, now the solution has become 'UltraSearch' for which I personally have to get acquainted about improvements.

resources, for my personal point of view this particular workflow did 
not scale well with existing systems having installed only the RDBMS 
with no spare capacity, specially in terms of CPU/Memory resources.


It scales enough, if you aren't trying to solve the google
problem. :-)  For our users, it's ok performance, even for the heavy
internal users.  Just being able to search message boards and file
areas (including word docs) is huge plus for us.

I have tried IFS within a system doing well the RDBMS job for a lo-mid sized/tuned configuration using Solaris 2.6 and Sun Sparc. IFS made to us the horrible first impression of putting the system down in it's knees. Frankly, didn't had the time/patiente to understand if there was a chance to tune-up a little more and accomplish with the scalation, in my opinion it should have been a waste of time for that particular situation without a real machine scalation.

Following your suggestion, I could drop the PDF textual contents 
achieved using pdftotext to a 'TEXT' datatype into a PostgreSQL, then 
use a search engine to look inside it to resemble a similar 
functionality regarding intermedia.

Regarding the search engine, guess that it should be necessary to have 
at least a de-structurated text search algorithm along with something 
like SOUNDEX in Oracle.


I don't think intermedia uses SOUNDEX.  It does pure keyword
matching.  It's particularly bad in my opinion.  It also doesn't learn
what people really want to know.  For example, if you search:

http://www.bivio.com/pub/search?s=taxes

You always get the IRS Pubs, but this is rarely what people are
looking for on our site (although they should read the publications,
they are more interested in what bivio can do for them in terms of
taxes).  Note the performance on the search.  The data set you are
searching in the public case is very small in comparison to the whole
document database which is multi-GB.

Hope this helps.

Rob

I'm not sure what intermedia uses to search text, certainly it don't learns anything about searches (don't know what 'Ultrasearch' is capable of despite all the hyphe Oracle is putting into this technology as usually) . Regarding the search in bivio.com, it's quite okay in terms of human-awareness response but probably it should do better thinking in terms of a 12 pages indexed data set.

Thanks a lot for your valuable suggestions. I will let you know just in case of further evolution from what we've talking about.

All the best.

Fabian.

Re: General interest question: PDF contents handling in PostgreSQL.

Reply via email to