On Oct 18, Jay Klein <[EMAIL PROTECTED]> wrote: > First I created a movies.list.gz file that contained only the > titles for Futurama and its episodes, and ran imdbpy2sql.py with > only that file.
Very good test! You have spotted the bug. > I have a hunch that the multiple records for the series have > something to do with the "TOO MANY DATA (100000 items), SPLITTING > (run #1)..." type messages Yes; the bug must be there. :-/ As I've said, unfortunately I won't have a fresh copy of the plain text data files for some time; would you be so nice to run a simple test? Here it is: edit the imdbpy2sql.py file, and search for this line: def __init__(self, d=None, flushEvery=100000): it should be 397th line (it defines the initialization of a _BaseCache object). Now: change 100000 to something smaller, maybe 50006, and run again the script on the whole data set. If the "TOO MANY DATA" still appears, interrupt the script and reduce the value once again (if it will appear with a value different from the one you've set, ignore it: it should refer to a different data set - I've said the "strange number" 50006 so that it cannot be mistaken with other values I've set in the script). My guess is that in a run without the need to split the data flushed to the db, everything will be ok. > I haven't examined the code closely enough to really comment > intelligently on it, Well... let's say that the code is designed for high performances (it's the nice version for "it's almost crap!" ;-) Seriously: unfortunately it's a task that requires a lot of computation, and _a lot_ of dirty tricks are used; they made the code somewhat ugly, but at least it's fast [1]. > but I think the idea here is when you're flushing some of the > list entries to the database, you sometimes divide the set into > smaller chunks. Right. We collect the data (variously processed) in Python structures (lists, dictionaries, tuples, ...) and from time to time, when we've collected a lot of items, we flush the current set in the database. It works this way for performance reasons. Unfortunately MySQL has a strict limit to the amount of data it can receive in a single shot, and even if we have sized ("sized"? I'm not sure it's correct... pardon my bad English) these caches so that they don't require the data to be splitted, being the plaint text data files always changing (and because there are different MySQL configuration out there) I've assumed that sometimes you can hit a bad situation: you're trying to send more data that the db can accept. In this situation the data is split in two pieces, which are flushed separately. This portion of code was veeery poorly test, believe me! ;-) > These lines in the _toDB() method of the MoviesCache class seem > like they could play a part in producing this behavior (line 578 > of imdbpy2sql.py): Maybe you're right. I've to look closer at that: it's a lot of time that I have not modified the script (and it's poorly commented). > Thanks again for the help. Thanks to you! Obviously you've won an entry in the CREDITS file. :-) +++ [1] JMDB [2] a good interface to the plain text data files which populate a MySQL database using a Java program and _without_ a single data manipulation (in comparison to the even excessive manipulation done by imdbpy2sql.py) takes more or less the same amount of time to complete. [2] http://www.jmdb.de/ -- Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47] http://erlug.linux.it/~da/ ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel