(i forgot to add the list address to this reply - Jesper, thanks for inspiring me to make all these queries - i hope you don't think I was trying to shut your comment down by any means as you hold a solid point about the enormity of the field) - Mark

Jesper Noehr wrote:
Hi Mark,

It makes sense to resize the column to fit the data, although I don't think 65536 bytes of binary data is appropriate. This would put a more serious impact on performance. I think as of MySQL 4.1, varchars are not limited to 255 any longer, do you know if this is correct? If so, I think the column could do with being resized to ~500 or so.


j
Hi Jesper,

According to the mysql manual, it seems that a VARCHAR can be over 255 chars (from my understanding of the description). But it doesn't seem to specify the actual limit.
http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html
Under the heading "Storage Requirements for String Types"

But the code I proposed, though it does specify varchar=True, is not actually creating a VARCHAR column (as described in my last reply), but rather a MEDIUMTEXT.

As for the length of the field for the application, I've little say. I was using this specifically for the imdb -> sql conversion and plan to use the data moreso than the imdbpy library (for now at least). But, in the interest of the thread, I ran a few queries to get a better understanding of what's in the person_info table.

SELECT AVG(LENGTH(info)) FROM person_info
returns 88.4113, so you definitely have a point about truncating, but

SELECT COUNT(id) FROM person_info WHERE LENGTH(info) > 500
returns 40781, which is about 2% of the data

SELECT COUNT(id) FROM person_info WHERE LENGTH(info) > 1000
returns 20224, which is about 1% of the data.

1500 is about 1/2 a percent (.006)

Here's some general statistics about the table
SELECT t.info, AVG(LENGTH(p.info)) AS AVGLength, MAX(LENGTH(p.info)) AS MaxLength, COUNT(p.id) AS Records FROM person_info p LEFT JOIN info_type t ON p.info_type_id = t.id GROUP BY p.info_type_id ORDER BY AVG(LENGTH(p.info)) DESC

info               AVGLength     MaxLength     Records
mini biography     1328.7544         77754       41996
quotes              199.1172         10402       42137
articles            105.4514           994       71346
trivia              104.6574          6197      403739
interviews          102.0143           390       33992
books                98.4247           423       14018
pictorials           93.6796           659       42350
other works          90.6187          6658      199107
where now            87.1480          1008       21929
trademarks           53.4940          1121        7892
spouse               47.0573           124       82573
biographical movies  45.4442           199        4268
salary history       43.9208           159        4469
magazine covers      41.3896           151       46944
portrayed            36.2904           123        8496
death notes          29.9600           150       65228
birth notes          22.7949           112      236797
birth name           21.4084           218       83942
death date           13.4271            95       77152
birth date           12.5768            39      258693
nick names            7.9800           129       34466
height                5.5751            13       71154


Here's what's being cut off over 1000 chars
SELECT t.info, COUNT(p.id) AS Records FROM person_info p LEFT JOIN info_type t ON p.info_type_id = t.id WHERE LENGTH(p.info) > 1000 GROUP BY p.info_type_id ORDER BY Records DESC

info             Records
mini biography     19238
trivia               438
other works          344
quotes               202
trademarks             1
where now              1


And over 2000 chars
info             Records
mini biography      7616
other works           50
trivia                48
quotes                12


Hope this helps...

Have a great weekend!

Mark


On Apr 3, 2008, at 7:48 PM, Mark Armendariz wrote:
Hello,

I've just joined this list. I found this app this morning and downloaded the imdb database last night. While converting the data, I was getting an error stating that some data was too long for the person_info.info field in the biographies list. This patch changes the person_info.info field from a text to a mediumtext field.

After making this change, I was able to successfully convert the text files into a mysql database.

Thanks!!

Mark Armendariz

=== modified file 'imdbpy/imdb/parser/sql/dbschema.py'
--- imdbpy/imdb/parser/sql/dbschema.py    2008-04-02 06:48:00 +0000
+++ imdbpy/imdb/parser/sql/dbschema.py    2008-04-02 06:49:41 +0000
@@ -128,7 +128,7 @@
class PersonInfo(SQLObject):
    personID = IntCol(notNone=True)
    infoTypeID = IntCol(notNone=True)
-    info = UnicodeCol(notNone=True)
+    info = BLOBCol(notNone=True, varchar=True, length=2**16)
    note = UnicodeCol(default=None)

class RoleType(SQLObject):



-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel




-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Reply via email to