On Apr 04, Mark Armendariz <[EMAIL PROTECTED]> wrote:

> I was trying to figure out the best way to make a minimal change to
> the code.  All I really wanted to do was use a MEDIUMTEXT instead
> of a plain TEXT field so it could fit the data properly.

Your solution makes sense: any value between 2**16 and 2**24-1
will be mapped to a MEDIUMTEXT, in MySQL.

Anyway, right now, I'm wondering why your database server doesn't
just truncate a too long value.  Is it the default behavior (in
MySQL and other db servers)?
Mine (MySQL 5.0.32), just issues a warning.

I'm still a bit confused about the whole issue, and when I'm
confused some of my worse habits come out: as everyone, as a child,
I used to lie sometimes - now I'm a grownup I... WRITE BENCHMARKS! ;-)

Using the attached simple script, I got these results on an old system:

==================================================================
mysql 5.0.32 UnicodeCol (it uses the TEXT type)
11.08 sec 64012 KB

mysql 5.0.32 BLOBCol (it uses the MEDIUMTEXT type)
14.47 sec 65344 KB

postgresql 8.1.2 UnicodeCol (it uses the TEXT type)
8.93 sec 1336 KB - not sure if this is the effectively used disc space.

postgresql 8.1.2 BLOBCol (it uses the BYTEA type)
5.23 sec 1336 KB

sqlite 3.3.8 UnicodeCol (it uses the TEXT type)
6.94 sec 66084 KB

sqlite 3.3.8 BLOBCol (it uses the VARCHAR(65536) type) - THAT'S AN ABOMINATION!!
11.95 sec 89104 KB

==================================================================

Some of the results are a bit unexpected, but it's not our business. ;-)
SQLite, with BLOBCol, uses a VARCHAR of the given length, and that's
a very bad thing.  Everything else looks promising, even if it takes
over 30% longer to populate a BLOBCol in MySQL, and it's not exactly
a brilliant result. :-/

Oh, by the way: I've not seen the Jesper's reply (was it directed to
this list, too?), so maybe I've not taken something in consideration.

Talking about the "truncate the string in the code" solution: it
has some impact on performances, too - even if it's probably not
too much noticeable given that there are a lot of other, slower,
operations in the code and that a lot of strings are very shorts.

I'm also wondering the best place to cut it: it's possible to
replace line #1419 of imdbpy2sql.py with:
  if v: sqldata.add((mopid, theid, v[:65535], note))

Another option: cut only person's biographies in line #128
of the parser.local.personParser.py file.

The second solution is obviously less generic but faster, and has another
funny (while probably unnoticeable, in practice) pro: here strings
are unicode object, and you can safely cut them everywhere you
want.  The imdbpy2sql.py script, on the other hand, internally
manages (for a lot of SURELY GOOD reasons I actually forgot ;-)
everything as strings encoded in utf8, and you must pay a bit
of attention slicing them, or you'll end up cutting a non-ASCII
char in two, and some databases will complain at insert time.

Hmmm.... yes: I'm _so much_ confused. ;-)
Options:
1. write IF IT HAPPENS IT'S YOUR FAULT in the documentation. ;-)
2. use BLOBCol.
3. cut the strings somewhere.


> I'm using Mysql 5.0.45-community with InnoDB tables.

I have encountered some problems running the benchmarks on a InnoDB
table - I'll try to fix it later.
In the first run, have you used the '--mysql-innodb' option of
the imdbpy2sql.py script?  With it, data insertion is done on a
MyISAM tables, later converted to InnoDB for performances reasons.


> 77754      Susan Atkins, a.k.a. Sadie Mae Glutz, is one of th...

I'm about to vote '1' to every single movie in which Susan Atkins
ever appeared in... ;-)



-- 
Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47]
http://erlug.linux.it/~da/
#!/usr/bin/env python
# -*- coding: utf-8 -*-

from sqlobject import *

URI = 'mysql://USER:[EMAIL PROTECTED]/DB'
URI = 'postgres://USER:[EMAIL PROTECTED]/DB'
URI = 'sqlite:///tmp/test.db'


TEST_TXT = 'Lorem ipsum dolor sit amet ' \
            'consectetur adipisicing elit, sed do eiusmod tempor ' \
            'incididunt ut labore et dolore magna ' \
            'aliqua. Ut enim ad minim veniam, quis nostrud exercitation ' \
            'ullamco laboris nisi ut aliquip ex ea ' \
            'commodo consequat. Duis aute irure dolor ' \
            'in reprehenderit in voluptate velit esse cillum dolore eu ' \
            'fugiat nulla pariatur. Excepteur sint occaecat ' \
            'cupidatat non proident, sunt in culpa qui officia ' \
            'deserunt mollit anim id est laborum.\n'

# 66900 chars.
LONG_TEXT = TEST_TXT*150

class PersonInfo(SQLObject):
    #info = UnicodeCol(notNone=True)
    info = BLOBCol(notNone=True, varchar=True, length=2**16)

conn = connectionForURI(URI)
#conn.debug = True
PersonInfo.setConnection(conn)
PersonInfo.sqlmeta.cacheValues = False
PersonInfo._cacheValue = False
PersonInfo.dropTable(ifExists=True)
PersonInfo.createTable()

for count in xrange(1000):
    PersonInfo(info=LONG_TEXT)


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Register now and save $200. Hurry, offer ends at 11:59 p.m., 
Monday, April 7! Use priority code J8TLD2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Reply via email to