On Oct 18, Jay Klein <[EMAIL PROTECTED]> wrote:

> First I created a movies.list.gz file that contained only the
> titles for Futurama and its episodes, and ran imdbpy2sql.py with
> only that file.

Very good test!  You have spotted the bug.

> I have a hunch that the multiple records for the series have
> something to do with the "TOO MANY DATA (100000 items), SPLITTING
> (run #1)..." type messages

Yes; the bug must be there. :-/

As I've said, unfortunately I won't have a fresh copy of the plain
text data files for some time; would you be so nice to run a simple
test?

Here it is: edit the imdbpy2sql.py file, and search for this line:
  def __init__(self, d=None, flushEvery=100000):

it should be 397th line (it defines the initialization of a _BaseCache
object).

Now: change 100000 to something smaller, maybe 50006, and run again
the script on the whole data set.
If the "TOO MANY DATA" still appears, interrupt the script and reduce
the value once again (if it will appear with a value different from the
one you've set, ignore it: it should refer to a different data set -
I've said the "strange number" 50006 so that it cannot be mistaken with
other values I've set in the script).

My guess is that in a run without the need to split the data flushed
to the db, everything will be ok.

> I haven't examined the code closely enough to really comment
> intelligently on it,

Well... let's say that the code is designed for high performances (it's
the nice version for "it's almost crap!" ;-)

Seriously: unfortunately it's a task that requires a lot of computation,
and _a lot_ of dirty tricks are used; they made the code somewhat ugly,
but at least it's fast [1].

> but I think the idea here is when you're flushing some of the
> list entries to the database, you sometimes divide the set into
> smaller chunks.

Right.
We collect the data (variously processed) in Python structures (lists,
dictionaries, tuples, ...) and from time to time, when we've collected
a lot of items, we flush the current set in the database.

It works this way for performance reasons.
Unfortunately MySQL has a strict limit to the amount of data it can
receive in a single shot, and even if we have sized ("sized"?  I'm
not sure it's correct... pardon my bad English) these caches so that
they don't require the data to be splitted, being the plaint text
data files always changing (and because there are different MySQL
configuration out there) I've assumed that sometimes you can
hit a bad situation: you're trying to send more data that the db
can accept.
In this situation the data is split in two pieces, which are
flushed separately.
This portion of code was veeery poorly test, believe me! ;-)

> These lines in the _toDB() method of the MoviesCache class seem
> like they could play a part in producing this behavior (line 578
> of imdbpy2sql.py):

Maybe you're right.  I've to look closer at that: it's a lot of
time that I have not modified the script (and it's poorly commented).

> Thanks again for the help.

Thanks to you!
Obviously you've won an entry in the CREDITS file. :-)


+++
[1] JMDB [2] a good interface to the plain text data files which
    populate a MySQL database using a Java program and _without_ a
    single data manipulation (in comparison to the even excessive
    manipulation done by imdbpy2sql.py) takes more or less the same
    amount of time to complete.
[2] http://www.jmdb.de/
-- 
Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47]
http://erlug.linux.it/~da/

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Reply via email to